You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/11/27 20:32:52 UTC

using spark-submit to launch CLI jobs

Currently we create a SparkMahoutContext, and use “mahout -spark classpath” to create the SparkContext. the SparkConf is also directly accessed. If we move to using spark-submit for launching the Mahout Shell and other drivers we would need to refactor some of this and change the mahout script. It seems desirable to have and driver code create the Spark context and rely on spark-submit for any config overrides and params. This implies the possible removal (not sure about this) of SparkMahoutContext. In general it would be nice if this were done outside of Mahout, or limited to the drivers and shell. Mahout has become a library that is designed to be backend independent. This code was designed before this became a goal and is beyond my understanding to fully grasp how much work would be involved and what would replace it.

The code refactoring needed is not well understood, by me at least. But intuition says that with a growing number of backends it might be good to clean up the Spark dependencies for context management. This has also been a bit of a problem in creating apps that use Mahout since typical spark-submit use cannot be relied on to make config changes, they must be made in environment variables only. These arguably non-standard manipulation of the context puts limitations and hidden assumptions into using Mahout as a library. 

Doing all of this implies a fairly large bit of work, I think. The benefit is that it will be more clear how to use Mahout as a library and in cleaning up some unneeded code. I’m not sure I have enough time to do all of this myself. 

This isn’t so much a proposal as a call for discussion.



Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS all is needed to run a submitted application is to tell the context not
to look for mahout standard jars (assuming it is not installed on every
cluster node) but rather run the Uber jar instead that the app was started
from (which is trivially resolved for example from Hadoop get jar util).
Don't remember details exactly but can probably clarify further if needed.

But point is, everything to support custom class paths is already in the
api, including situations involving running off the spark-submit api.

Unfortunately, the whole Spark submit deal is pretty awkward. One usually
needs to have his app build to build Uber jars, including all transitive
dependencies except those already in spark. And then either use
undocumented submit class or the official way which is running the script.

I think the undocumented api become rewritten once again in later Spark
versions and has been declared user level after that, iirc.

That means that there is very little incentive to do mahout submit code in
the mahout project itself, since it will only be able to handle submits
that rely exclusively on the code that is already in Mahout itself, but the
external code will still have to take care of building Uber jars, which is
really the main thing there is to take care of.

Re: using spark-submit to launch CLI jobs

Posted by Andrew Palumbo <ap...@outlook.com>.
Pat, That seems like a good approach- my only ask would be that you keep mahoutSparkContext public.   

________________________________________
From: Pat Ferrel <pa...@occamsmachete.com>
Sent: Sunday, November 29, 2015 1:33 PM
To: dev@mahout.apache.org
Subject: Re: using spark-submit to launch CLI jobs

BTW I agree with a later reply from Dmitriy that real use of Mahout generally will employ spark-submit so the motivation is primarily related to launching app/driver level things in Mahout. But these have broken several times now partly due to Mahout not following the spark-submit conventions (ever changing though they may be).

One other motivation is that the Spark bindings mahoutSparkContext function calls the mahout script to get a classpath and then creates a Spark Context. It might be good to make this private to Mahout (used only in the test suites) so users don’t see this as the only or preferred way to create a SparkMahoutContext, which seems better constructed from a Spark Context.

    implicit val sc = <Spark Context creation code>
    implicit val mc = SparkDistributedContext( sc )

Since the drivers are sometimes used as examples of employing Mahout with Spark we could change them to use the above method and for the same reasons employing spark-submit to launch them is the right example to give.

If no one is particularly interested in this bit of refactoring or has no contrary opinions to the above I’m inclined to do this as I have time.


On Nov 28, 2015, at 10:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

I use spark-submit also to launch apps that use Mahout so not sure what assumptions you are talking about. The first thing is to use spark-submit in our own launch script.

The current code calls the CLI mahout script to get classpath info, this should be passed in to the spark-submit so if we launch with spark-submit I think the call of the mahout script would be unnecessary. This makes it more straightforward to use with Yarn cluster mode where the client/driver is launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t understand all of those ramifications.

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
>
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
>
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
>
> This isn’t so much a proposal as a call for discussion.
>
>
>



Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS but what i see as a major problem with spark-submit intended support in
mahout is that whatever we do, we will not facilitate spark-submit for
users that write their own algorithms outside of mahout code base structure
(which is my case). Nor such support is sorely missed, since all is needed
is to properly decorate mahout context creation in a submitted application
as i have previously explained.

Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
if you want to add a spark-submit option to any of current CLI, it is very
EASY without change of any of existing context factory method apis.

If unsure, ask how.

On Thu, Dec 3, 2015 at 9:17 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Rather than respond to these.
>
> I read votes of -1 and neutral to touching this part of Mahout. So I am
> currently uninclined to mess with it. I’ll concentrate on documenting how
> to use Mahout with external apps.
>
> On Nov 29, 2015, at 9:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Sun, Nov 29, 2015 at 10:33 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > BTW I agree with a later reply from Dmitriy that real use of Mahout
> > generally will employ spark-submit
>
>
> I never said that. I said i use it for stress tests to test out certain
> components of algorithms under pressure. For the "real thing" i can
> unfortunately only use coarse grained long-living driver-side session
> management because fine grained scheduling and (god forbid) submit makes
> transition from awful strong scale properties of spark in coarse grained
> scheduling (in terms of iterations)  to impossible due to my algorithm
> specifics and product reqs.Spark submits are inherently evil as it comes to
> exploratory analysis.
>
>
> > so the motivation is primarily related to launching app/driver level
> > things in Mahout. But these have broken several times now partly due to
> > Mahout not following the spark-submit conventions (ever changing though
> > they may be).
> >
> > One other motivation is that the Spark bindings mahoutSparkContext
> > function calls the mahout script to get a classpath and then creates a
> > Spark Context. It might be good to make this private to Mahout (used only
> > in the test suites) so users don’t see this as the only or preferred way
> to
> > create a SparkMahoutContext, which seems better constructed from a Spark
> > Context.
> >
>
>    implicit val sc = <Spark Context creation code>
> >    implicit val mc = SparkDistributedContext( sc )
> >
> >
> Again, i don't see a problem.
>
> if we _carefully_ study context factory method [1], we will notice that it
> has parameter addMahoutJars which can be set to false, in which case the
> factory method doesn't say any of what you imply. It doesn't require
> calling out mahout script, it doesn't even require MAHOUT_HOME. On top of
> it, it allows you to add your own jars ('customJars' parameter) or even
> override whatever in SparkConf directly (parameter sparkConf). I don't know
> what more details it can possibly expose to do whatever you want with the
> context.
>
> If for some reason you don't have control over context creation parameter
> application, and you absolutely must wrap existing spark context, this is
> also possible by just doing 'new DistributedSparkContext(sc)' and in fact i
> am guilty of having that in a few situations as well. But I think there's a
> good reason to discourage that because we do want to assert certain
> identities in the context to make sure it is still workable. not many, just
> the Kryo serialization for now, but what we assert is immaterial; what is
> material is that we do want to retain some control over context parameters,
> if nothing else but to validate it.
>
> So yes the factory method is still preferrable, even for spark submit
> applications. Perhaps a tutorial how to do that is a good thing to do, but
> i don't see what could be essentially better than what it is now.
>
> Maybe when/if Mahout has a much richer library of standard implementations
> so that someone wants to actually use command lines to run them, maybe it's
> worth to make these having spark-submit options. (as you know i am
> sceptical about command lines though).
>
> If you want to implement a submit option for cross recommender in addition
> to any cli that exists, sure go ahead. it is easy with existing api. let me
> know if you need any further information beyond provided here.
>
> [1]
>
> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
> line 58
>
>

Re: using spark-submit to launch CLI jobs

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Rather than respond to these.

I read votes of -1 and neutral to touching this part of Mahout. So I am currently uninclined to mess with it. I’ll concentrate on documenting how to use Mahout with external apps.  

On Nov 29, 2015, at 9:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Sun, Nov 29, 2015 at 10:33 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW I agree with a later reply from Dmitriy that real use of Mahout
> generally will employ spark-submit


I never said that. I said i use it for stress tests to test out certain
components of algorithms under pressure. For the "real thing" i can
unfortunately only use coarse grained long-living driver-side session
management because fine grained scheduling and (god forbid) submit makes
transition from awful strong scale properties of spark in coarse grained
scheduling (in terms of iterations)  to impossible due to my algorithm
specifics and product reqs.Spark submits are inherently evil as it comes to
exploratory analysis.


> so the motivation is primarily related to launching app/driver level
> things in Mahout. But these have broken several times now partly due to
> Mahout not following the spark-submit conventions (ever changing though
> they may be).
> 
> One other motivation is that the Spark bindings mahoutSparkContext
> function calls the mahout script to get a classpath and then creates a
> Spark Context. It might be good to make this private to Mahout (used only
> in the test suites) so users don’t see this as the only or preferred way to
> create a SparkMahoutContext, which seems better constructed from a Spark
> Context.
> 

   implicit val sc = <Spark Context creation code>
>    implicit val mc = SparkDistributedContext( sc )
> 
> 
Again, i don't see a problem.

if we _carefully_ study context factory method [1], we will notice that it
has parameter addMahoutJars which can be set to false, in which case the
factory method doesn't say any of what you imply. It doesn't require
calling out mahout script, it doesn't even require MAHOUT_HOME. On top of
it, it allows you to add your own jars ('customJars' parameter) or even
override whatever in SparkConf directly (parameter sparkConf). I don't know
what more details it can possibly expose to do whatever you want with the
context.

If for some reason you don't have control over context creation parameter
application, and you absolutely must wrap existing spark context, this is
also possible by just doing 'new DistributedSparkContext(sc)' and in fact i
am guilty of having that in a few situations as well. But I think there's a
good reason to discourage that because we do want to assert certain
identities in the context to make sure it is still workable. not many, just
the Kryo serialization for now, but what we assert is immaterial; what is
material is that we do want to retain some control over context parameters,
if nothing else but to validate it.

So yes the factory method is still preferrable, even for spark submit
applications. Perhaps a tutorial how to do that is a good thing to do, but
i don't see what could be essentially better than what it is now.

Maybe when/if Mahout has a much richer library of standard implementations
so that someone wants to actually use command lines to run them, maybe it's
worth to make these having spark-submit options. (as you know i am
sceptical about command lines though).

If you want to implement a submit option for cross recommender in addition
to any cli that exists, sure go ahead. it is easy with existing api. let me
know if you need any further information beyond provided here.

[1]
https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
line 58


Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Sun, Nov 29, 2015 at 10:33 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> BTW I agree with a later reply from Dmitriy that real use of Mahout
> generally will employ spark-submit


I never said that. I said i use it for stress tests to test out certain
components of algorithms under pressure. For the "real thing" i can
unfortunately only use coarse grained long-living driver-side session
management because fine grained scheduling and (god forbid) submit makes
transition from awful strong scale properties of spark in coarse grained
scheduling (in terms of iterations)  to impossible due to my algorithm
specifics and product reqs.Spark submits are inherently evil as it comes to
exploratory analysis.


> so the motivation is primarily related to launching app/driver level
> things in Mahout. But these have broken several times now partly due to
> Mahout not following the spark-submit conventions (ever changing though
> they may be).
>
> One other motivation is that the Spark bindings mahoutSparkContext
> function calls the mahout script to get a classpath and then creates a
> Spark Context. It might be good to make this private to Mahout (used only
> in the test suites) so users don’t see this as the only or preferred way to
> create a SparkMahoutContext, which seems better constructed from a Spark
> Context.
>

    implicit val sc = <Spark Context creation code>
>     implicit val mc = SparkDistributedContext( sc )
>
>
Again, i don't see a problem.

if we _carefully_ study context factory method [1], we will notice that it
has parameter addMahoutJars which can be set to false, in which case the
factory method doesn't say any of what you imply. It doesn't require
calling out mahout script, it doesn't even require MAHOUT_HOME. On top of
it, it allows you to add your own jars ('customJars' parameter) or even
override whatever in SparkConf directly (parameter sparkConf). I don't know
what more details it can possibly expose to do whatever you want with the
context.

If for some reason you don't have control over context creation parameter
application, and you absolutely must wrap existing spark context, this is
also possible by just doing 'new DistributedSparkContext(sc)' and in fact i
am guilty of having that in a few situations as well. But I think there's a
good reason to discourage that because we do want to assert certain
identities in the context to make sure it is still workable. not many, just
the Kryo serialization for now, but what we assert is immaterial; what is
material is that we do want to retain some control over context parameters,
if nothing else but to validate it.

So yes the factory method is still preferrable, even for spark submit
applications. Perhaps a tutorial how to do that is a good thing to do, but
i don't see what could be essentially better than what it is now.

Maybe when/if Mahout has a much richer library of standard implementations
so that someone wants to actually use command lines to run them, maybe it's
worth to make these having spark-submit options. (as you know i am
sceptical about command lines though).

If you want to implement a submit option for cross recommender in addition
to any cli that exists, sure go ahead. it is easy with existing api. let me
know if you need any further information beyond provided here.

[1]
https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
line 58

Re: using spark-submit to launch CLI jobs

Posted by Pat Ferrel <pa...@occamsmachete.com>.
BTW I agree with a later reply from Dmitriy that real use of Mahout generally will employ spark-submit so the motivation is primarily related to launching app/driver level things in Mahout. But these have broken several times now partly due to Mahout not following the spark-submit conventions (ever changing though they may be).

One other motivation is that the Spark bindings mahoutSparkContext function calls the mahout script to get a classpath and then creates a Spark Context. It might be good to make this private to Mahout (used only in the test suites) so users don’t see this as the only or preferred way to create a SparkMahoutContext, which seems better constructed from a Spark Context.

    implicit val sc = <Spark Context creation code>
    implicit val mc = SparkDistributedContext( sc )

Since the drivers are sometimes used as examples of employing Mahout with Spark we could change them to use the above method and for the same reasons employing spark-submit to launch them is the right example to give.

If no one is particularly interested in this bit of refactoring or has no contrary opinions to the above I’m inclined to do this as I have time.


On Nov 28, 2015, at 10:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

I use spark-submit also to launch apps that use Mahout so not sure what assumptions you are talking about. The first thing is to use spark-submit in our own launch script.

The current code calls the CLI mahout script to get classpath info, this should be passed in to the spark-submit so if we launch with spark-submit I think the call of the mahout script would be unnecessary. This makes it more straightforward to use with Yarn cluster mode where the client/driver is launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t understand all of those ramifications. 

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
> 
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
> 
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
> 
> This isn’t so much a proposal as a call for discussion.
> 
> 
> 



Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Sat, Nov 28, 2015 at 10:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I use spark-submit also to launch apps that use Mahout so not sure what
> assumptions you are talking about.


Ok so if it works what's the problem. I am lost.
I am talking about assumptions that anything dealing with context needs to
be changed or even removed.


> The first thing is to use spark-submit in our own launch script.
>
What script would that be?


> The current code calls the CLI mahout script to get classpath info, this
> should be passed in to the


Which code? mahout context creation? As i said, you can customize that
behavior. You can tell it not to look for standard jars + get your own jars
into classpath. Should be flexible enough to handle any startup situation.


> spark-submit so if we launch with spark-submit I think the call of the
> mahout script would be unnecessary. This makes it more straightforward to
> use with Yarn cluster mode where the client/driver is launched on some
> cluster machine where there would be no script to call.
>

Again, see comment above.  Yes, i did submits to yarn and standalone, you
name it. it is all good.

>
> If the SparkMahoutContext is a hard requirement that’s fine.


Every single operation uses context (which essentially wraps backend
context). it is not passed in, it is implied by a dataset parameter. No
physical operator can work without it.

For most part, context is required because the backend engines require a
session equivalent of it (SparkContext in Spark's case). This is more a
hard requirement on the backend part.


> As I said, I don’t understand all of those ramifications.
>
> On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I do submits all the time, don't see any problem. It is part of my standard
> stress test harness.
>
> Mahout context is conceptual and cannot be removed, nor it is required to
> be removed in order to run submitted jobs. Submission and contexts are two
> completely separate concepts. One can submit a job that for example doesn't
> set up a spark job at all and runs for example a Mr job, or just
> manipulates some HDFS directories, or sets up multiple jobs or combinations
> of all of the above. All submission means is sending an Uber jar to an
> application server and launching a main class there, instead of doing the
> same locally. Not sure where these all assumptions are coming from.
> On Nov 27, 2015 11:33 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:
>
> > Currently we create a SparkMahoutContext, and use “mahout -spark
> > classpath” to create the SparkContext. the SparkConf is also directly
> > accessed. If we move to using spark-submit for launching the Mahout Shell
> > and other drivers we would need to refactor some of this and change the
> > mahout script. It seems desirable to have and driver code create the
> Spark
> > context and rely on spark-submit for any config overrides and params.
> This
> > implies the possible removal (not sure about this) of SparkMahoutContext.
> > In general it would be nice if this were done outside of Mahout, or
> limited
> > to the drivers and shell. Mahout has become a library that is designed to
> > be backend independent. This code was designed before this became a goal
> > and is beyond my understanding to fully grasp how much work would be
> > involved and what would replace it.
> >
> > The code refactoring needed is not well understood, by me at least. But
> > intuition says that with a growing number of backends it might be good to
> > clean up the Spark dependencies for context management. This has also
> been
> > a bit of a problem in creating apps that use Mahout since typical
> > spark-submit use cannot be relied on to make config changes, they must be
> > made in environment variables only. These arguably non-standard
> > manipulation of the context puts limitations and hidden assumptions into
> > using Mahout as a library.
> >
> > Doing all of this implies a fairly large bit of work, I think. The
> benefit
> > is that it will be more clear how to use Mahout as a library and in
> > cleaning up some unneeded code. I’m not sure I have enough time to do all
> > of this myself.
> >
> > This isn’t so much a proposal as a call for discussion.
> >
> >
> >
>
>

Re: using spark-submit to launch CLI jobs

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I use spark-submit also to launch apps that use Mahout so not sure what assumptions you are talking about. The first thing is to use spark-submit in our own launch script.

The current code calls the CLI mahout script to get classpath info, this should be passed in to the spark-submit so if we launch with spark-submit I think the call of the mahout script would be unnecessary. This makes it more straightforward to use with Yarn cluster mode where the client/driver is launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t understand all of those ramifications. 

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
> 
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
> 
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
> 
> This isn’t so much a proposal as a call for discussion.
> 
> 
> 


Re: using spark-submit to launch CLI jobs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <pa...@occamsmachete.com> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
>
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
>
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
>
> This isn’t so much a proposal as a call for discussion.
>
>
>