You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/11/05 18:48:59 UTC

Spark options

Spark has a launch script as hadoop does. We use the Hadoop launcher script but not the Spark one. When starting up your Spark cluster there is a spark-env.sh script that can set a bunch of environment variables. In our own mahoutSparkContext function, which takes the place of the Spark submit script and launcher we don’t account for most of the environment variables.

Unless I missed something this means most of the documented options will be ignored unless a user of Mahout parses and sets them in their own SparkConf. The Mahout CLI drivers don’t do this for all possible options, only supporting a few like job name and spark.executor.memory.

The question is how to best handle these Spark options. There seem to be two options:
1) use sparks launch mechanism for drivers but allow some to be overridden in the CLI
2) add parsing the env for options and set up the SparkConf default in mahoutSparkContext with those variables. 

The downside of #2 is that as variables change we’ll have to reflect those in our code. I forget why #1 is not an option but Dmitriy has been consistently against this—in any case it would mean a fair bit of refactoring I believe.

Any opinions or corrections?

RE: Spark options

Posted by Andrew Palumbo <ap...@outlook.com>.

Sounds good-  I hope to be working on one for naive bayes- been a bit hectic lately so hopefully sooner than later.  I'll have a better understanding of the CLI code then.      

> Subject: Re: Spark options
> From: pat@occamsmachete.com
> Date: Wed, 12 Nov 2014 10:49:43 -0800
> To: dev@mahout.apache.org
> 
> Andrew, when you get to creating a driver maybe we should take another look at how to launch them. I’ll add the -Dxxx=yyy option for now.
> 
>  
> On Nov 12, 2014, at 9:46 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> i do not object to driver CLI to use that. I was only skeptical about shell
> startup. And i also want these to be part of oficial Spark documented api.
> (are these classes it?) If they are not a stable api, we'd have trouble
> doing major dependency update. If we only depend on RDD api, the updates
> are easier.
> 
> But... if anyone wants to engineer and verify a patch to use these to
> launch mahout shell, and it works, I don't have really strong basis for
> objection aside for api stability concern.
> 
> On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> > yes, the drivers support executor memory directly too.
> > 
> > What was the reason you didn’t want to use the Spark submit process for
> > executing drivers? I understand we have to find our jars and setup kryo.
> > 
> > On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> > 
> > which is why i explicitly configure executor memory on the client. Although
> > even that interpretation  depends on the resource manager A LOT it seems.
> > 
> > On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> > 
> >> The submit code is the only place that documents which are needed by
> >> clients AFAICT. It is pretty complicated and heavily laden with checks
> > for
> >> which cluster manager is being used. I’d feel a lot better if we were
> > using
> >> it. There is no way any of us are going to be able to test on all those
> >> configurations.
> >> 
> >> spark-env.sh is mostly for launching the cluster not the client but there
> >> seem to be exceptions like executor memory.
> >> 
> >> 
> >> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >> 
> >> these files if i read it correctly are for spawning yet another process.
> > i
> >> don't see how it may work for the shell.
> >> 
> >> I am also not convinced that spark-env is important for the client.
> >> 
> >> 
> >> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >> 
> >>> I was thinking -Dx=y too, seems like a good idea.
> >>> 
> >>> But we should also support setting them the way Spark documents in
> >>> spark-env.sh and the two links Andrew found may solve that in a
> >>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> >>> function, which handles all env supplied setup. For the drivers it can
> > be
> >>> done in the base class allowing and CLI overrides later. Then the
> >> SparkConf
> >>> is finally passed in to mahoutSparkContext where as little as possible
> > is
> >>> changed in the conf.
> >>> 
> >>> I’ll look at this for the drivers. Should be easy to add to the shell.
> >>> 
> >>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>> 
> >>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> >>> parameters to the java startup call and all should be fine.
> >>> 
> >>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
> >>> wrote:
> >>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> I've run into this problem starting $ mahout shell-script.  i.e.
> > needing
> >>>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> >>> I've
> >>>> been temporarily hard coding them for now while developing.
> >>>> 
> >>>> I'm just getting familiar with What you've done with the CLI drivers.
> >>> For
> >>>> #2 could we borrow option parsing code/methods from spark [1] [2] at
> >> each
> >>>> (spark) release and somehow add this to
> >>>> MahoutOptionParser.parseSparkOptions?
> >>>> 
> >>>> I'll hopefully be doing some CLI work soon and have a better
> >>> understanding.
> >>>> 
> >>>> [1]
> >>>> 
> >>> 
> >> 
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> >>>> [2]
> >>>> 
> >>> 
> >> 
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> >>>> 
> >>>>> From: pat@occamsmachete.com
> >>>>> Subject: Spark options
> >>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
> >>>>> To: dev@mahout.apache.org
> >>>>> 
> >>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
> >>>> script but not the Spark one. When starting up your Spark cluster there
> >>> is
> >>>> a spark-env.sh script that can set a bunch of environment variables. In
> >>> our
> >>>> own mahoutSparkContext function, which takes the place of the Spark
> >>> submit
> >>>> script and launcher we don’t account for most of the environment
> >>> variables.
> >>>>> 
> >>>>> Unless I missed something this means most of the documented options
> >> will
> >>>> be ignored unless a user of Mahout parses and sets them in their own
> >>>> SparkConf. The Mahout CLI drivers don’t do this for all possible
> >> options,
> >>>> only supporting a few like job name and spark.executor.memory.
> >>>>> 
> >>>>> The question is how to best handle these Spark options. There seem to
> >> be
> >>>> two options:
> >>>>> 1) use sparks launch mechanism for drivers but allow some to be
> >>>> overridden in the CLI
> >>>>> 2) add parsing the env for options and set up the SparkConf default in
> >>>> mahoutSparkContext with those variables.
> >>>>> 
> >>>>> The downside of #2 is that as variables change we’ll have to reflect
> >>>> those in our code. I forget why #1 is not an option but Dmitriy has
> > been
> >>>> consistently against this—in any case it would mean a fair bit of
> >>>> refactoring I believe.
> >>>>> 
> >>>>> Any opinions or corrections?
> >>>> 
> >>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> >> 
> > 
> > 
>

Re: Spark options

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Andrew, when you get to creating a driver maybe we should take another look at how to launch them. I’ll add the -Dxxx=yyy option for now.

 
On Nov 12, 2014, at 9:46 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

i do not object to driver CLI to use that. I was only skeptical about shell
startup. And i also want these to be part of oficial Spark documented api.
(are these classes it?) If they are not a stable api, we'd have trouble
doing major dependency update. If we only depend on RDD api, the updates
are easier.

But... if anyone wants to engineer and verify a patch to use these to
launch mahout shell, and it works, I don't have really strong basis for
objection aside for api stability concern.

On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> yes, the drivers support executor memory directly too.
> 
> What was the reason you didn’t want to use the Spark submit process for
> executing drivers? I understand we have to find our jars and setup kryo.
> 
> On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> which is why i explicitly configure executor memory on the client. Although
> even that interpretation  depends on the resource manager A LOT it seems.
> 
> On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> The submit code is the only place that documents which are needed by
>> clients AFAICT. It is pretty complicated and heavily laden with checks
> for
>> which cluster manager is being used. I’d feel a lot better if we were
> using
>> it. There is no way any of us are going to be able to test on all those
>> configurations.
>> 
>> spark-env.sh is mostly for launching the cluster not the client but there
>> seem to be exceptions like executor memory.
>> 
>> 
>> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> these files if i read it correctly are for spawning yet another process.
> i
>> don't see how it may work for the shell.
>> 
>> I am also not convinced that spark-env is important for the client.
>> 
>> 
>> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> I was thinking -Dx=y too, seems like a good idea.
>>> 
>>> But we should also support setting them the way Spark documents in
>>> spark-env.sh and the two links Andrew found may solve that in a
>>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
>>> function, which handles all env supplied setup. For the drivers it can
> be
>>> done in the base class allowing and CLI overrides later. Then the
>> SparkConf
>>> is finally passed in to mahoutSparkContext where as little as possible
> is
>>> changed in the conf.
>>> 
>>> I’ll look at this for the drivers. Should be easy to add to the shell.
>>> 
>>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>> 
>>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
>>> parameters to the java startup call and all should be fine.
>>> 
>>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> 
>>>> I've run into this problem starting $ mahout shell-script.  i.e.
> needing
>>>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
>>> I've
>>>> been temporarily hard coding them for now while developing.
>>>> 
>>>> I'm just getting familiar with What you've done with the CLI drivers.
>>> For
>>>> #2 could we borrow option parsing code/methods from spark [1] [2] at
>> each
>>>> (spark) release and somehow add this to
>>>> MahoutOptionParser.parseSparkOptions?
>>>> 
>>>> I'll hopefully be doing some CLI work soon and have a better
>>> understanding.
>>>> 
>>>> [1]
>>>> 
>>> 
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
>>>> [2]
>>>> 
>>> 
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
>>>> 
>>>>> From: pat@occamsmachete.com
>>>>> Subject: Spark options
>>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>>>>> To: dev@mahout.apache.org
>>>>> 
>>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
>>>> script but not the Spark one. When starting up your Spark cluster there
>>> is
>>>> a spark-env.sh script that can set a bunch of environment variables. In
>>> our
>>>> own mahoutSparkContext function, which takes the place of the Spark
>>> submit
>>>> script and launcher we don’t account for most of the environment
>>> variables.
>>>>> 
>>>>> Unless I missed something this means most of the documented options
>> will
>>>> be ignored unless a user of Mahout parses and sets them in their own
>>>> SparkConf. The Mahout CLI drivers don’t do this for all possible
>> options,
>>>> only supporting a few like job name and spark.executor.memory.
>>>>> 
>>>>> The question is how to best handle these Spark options. There seem to
>> be
>>>> two options:
>>>>> 1) use sparks launch mechanism for drivers but allow some to be
>>>> overridden in the CLI
>>>>> 2) add parsing the env for options and set up the SparkConf default in
>>>> mahoutSparkContext with those variables.
>>>>> 
>>>>> The downside of #2 is that as variables change we’ll have to reflect
>>>> those in our code. I forget why #1 is not an option but Dmitriy has
> been
>>>> consistently against this—in any case it would mean a fair bit of
>>>> refactoring I believe.
>>>>> 
>>>>> Any opinions or corrections?
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Spark options

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

i do not object to driver CLI to use that. I was only skeptical about shell
startup. And i also want these to be part of oficial Spark documented api.
(are these classes it?) If they are not a stable api, we'd have trouble
doing major dependency update. If we only depend on RDD api, the updates
are easier.

But... if anyone wants to engineer and verify a patch to use these to
launch mahout shell, and it works, I don't have really strong basis for
objection aside for api stability concern.

On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> yes, the drivers support executor memory directly too.
>
> What was the reason you didn’t want to use the Spark submit process for
> executing drivers? I understand we have to find our jars and setup kryo.
>
> On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> which is why i explicitly configure executor memory on the client. Although
> even that interpretation  depends on the resource manager A LOT it seems.
>
> On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > The submit code is the only place that documents which are needed by
> > clients AFAICT. It is pretty complicated and heavily laden with checks
> for
> > which cluster manager is being used. I’d feel a lot better if we were
> using
> > it. There is no way any of us are going to be able to test on all those
> > configurations.
> >
> > spark-env.sh is mostly for launching the cluster not the client but there
> > seem to be exceptions like executor memory.
> >
> >
> > On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > these files if i read it correctly are for spawning yet another process.
> i
> > don't see how it may work for the shell.
> >
> > I am also not convinced that spark-env is important for the client.
> >
> >
> > On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> I was thinking -Dx=y too, seems like a good idea.
> >>
> >> But we should also support setting them the way Spark documents in
> >> spark-env.sh and the two links Andrew found may solve that in a
> >> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> >> function, which handles all env supplied setup. For the drivers it can
> be
> >> done in the base class allowing and CLI overrides later. Then the
> > SparkConf
> >> is finally passed in to mahoutSparkContext where as little as possible
> is
> >> changed in the conf.
> >>
> >> I’ll look at this for the drivers. Should be easy to add to the shell.
> >>
> >> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >>
> >> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> >> parameters to the java startup call and all should be fine.
> >>
> >> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
> >> wrote:
> >>
> >>>
> >>>
> >>>
> >>> I've run into this problem starting $ mahout shell-script.  i.e.
> needing
> >>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> >> I've
> >>> been temporarily hard coding them for now while developing.
> >>>
> >>> I'm just getting familiar with What you've done with the CLI drivers.
> >> For
> >>> #2 could we borrow option parsing code/methods from spark [1] [2] at
> > each
> >>> (spark) release and somehow add this to
> >>> MahoutOptionParser.parseSparkOptions?
> >>>
> >>> I'll hopefully be doing some CLI work soon and have a better
> >> understanding.
> >>>
> >>> [1]
> >>>
> >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> >>> [2]
> >>>
> >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> >>>
> >>>> From: pat@occamsmachete.com
> >>>> Subject: Spark options
> >>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
> >>>> To: dev@mahout.apache.org
> >>>>
> >>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
> >>> script but not the Spark one. When starting up your Spark cluster there
> >> is
> >>> a spark-env.sh script that can set a bunch of environment variables. In
> >> our
> >>> own mahoutSparkContext function, which takes the place of the Spark
> >> submit
> >>> script and launcher we don’t account for most of the environment
> >> variables.
> >>>>
> >>>> Unless I missed something this means most of the documented options
> > will
> >>> be ignored unless a user of Mahout parses and sets them in their own
> >>> SparkConf. The Mahout CLI drivers don’t do this for all possible
> > options,
> >>> only supporting a few like job name and spark.executor.memory.
> >>>>
> >>>> The question is how to best handle these Spark options. There seem to
> > be
> >>> two options:
> >>>> 1) use sparks launch mechanism for drivers but allow some to be
> >>> overridden in the CLI
> >>>> 2) add parsing the env for options and set up the SparkConf default in
> >>> mahoutSparkContext with those variables.
> >>>>
> >>>> The downside of #2 is that as variables change we’ll have to reflect
> >>> those in our code. I forget why #1 is not an option but Dmitriy has
> been
> >>> consistently against this—in any case it would mean a fair bit of
> >>> refactoring I believe.
> >>>>
> >>>> Any opinions or corrections?
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: Spark options

Posted by Pat Ferrel <pa...@occamsmachete.com>.

yes, the drivers support executor memory directly too.

What was the reason you didn’t want to use the Spark submit process for executing drivers? I understand we have to find our jars and setup kryo.

On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

which is why i explicitly configure executor memory on the client. Although
even that interpretation  depends on the resource manager A LOT it seems.

On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The submit code is the only place that documents which are needed by
> clients AFAICT. It is pretty complicated and heavily laden with checks for
> which cluster manager is being used. I’d feel a lot better if we were using
> it. There is no way any of us are going to be able to test on all those
> configurations.
> 
> spark-env.sh is mostly for launching the cluster not the client but there
> seem to be exceptions like executor memory.
> 
> 
> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> these files if i read it correctly are for spawning yet another process. i
> don't see how it may work for the shell.
> 
> I am also not convinced that spark-env is important for the client.
> 
> 
> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> I was thinking -Dx=y too, seems like a good idea.
>> 
>> But we should also support setting them the way Spark documents in
>> spark-env.sh and the two links Andrew found may solve that in a
>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
>> function, which handles all env supplied setup. For the drivers it can be
>> done in the base class allowing and CLI overrides later. Then the
> SparkConf
>> is finally passed in to mahoutSparkContext where as little as possible is
>> changed in the conf.
>> 
>> I’ll look at this for the drivers. Should be easy to add to the shell.
>> 
>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>> 
>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
>> parameters to the java startup call and all should be fine.
>> 
>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
>> wrote:
>> 
>>> 
>>> 
>>> 
>>> I've run into this problem starting $ mahout shell-script.  i.e. needing
>>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
>> I've
>>> been temporarily hard coding them for now while developing.
>>> 
>>> I'm just getting familiar with What you've done with the CLI drivers.
>> For
>>> #2 could we borrow option parsing code/methods from spark [1] [2] at
> each
>>> (spark) release and somehow add this to
>>> MahoutOptionParser.parseSparkOptions?
>>> 
>>> I'll hopefully be doing some CLI work soon and have a better
>> understanding.
>>> 
>>> [1]
>>> 
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
>>> [2]
>>> 
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
>>> 
>>>> From: pat@occamsmachete.com
>>>> Subject: Spark options
>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>>>> To: dev@mahout.apache.org
>>>> 
>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
>>> script but not the Spark one. When starting up your Spark cluster there
>> is
>>> a spark-env.sh script that can set a bunch of environment variables. In
>> our
>>> own mahoutSparkContext function, which takes the place of the Spark
>> submit
>>> script and launcher we don’t account for most of the environment
>> variables.
>>>> 
>>>> Unless I missed something this means most of the documented options
> will
>>> be ignored unless a user of Mahout parses and sets them in their own
>>> SparkConf. The Mahout CLI drivers don’t do this for all possible
> options,
>>> only supporting a few like job name and spark.executor.memory.
>>>> 
>>>> The question is how to best handle these Spark options. There seem to
> be
>>> two options:
>>>> 1) use sparks launch mechanism for drivers but allow some to be
>>> overridden in the CLI
>>>> 2) add parsing the env for options and set up the SparkConf default in
>>> mahoutSparkContext with those variables.
>>>> 
>>>> The downside of #2 is that as variables change we’ll have to reflect
>>> those in our code. I forget why #1 is not an option but Dmitriy has been
>>> consistently against this—in any case it would mean a fair bit of
>>> refactoring I believe.
>>>> 
>>>> Any opinions or corrections?
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Spark options

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

which is why i explicitly configure executor memory on the client. Although
even that interpretation  depends on the resource manager A LOT it seems.

On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The submit code is the only place that documents which are needed by
> clients AFAICT. It is pretty complicated and heavily laden with checks for
> which cluster manager is being used. I’d feel a lot better if we were using
> it. There is no way any of us are going to be able to test on all those
> configurations.
>
> spark-env.sh is mostly for launching the cluster not the client but there
> seem to be exceptions like executor memory.
>
>
> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> these files if i read it correctly are for spawning yet another process. i
> don't see how it may work for the shell.
>
> I am also not convinced that spark-env is important for the client.
>
>
> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > I was thinking -Dx=y too, seems like a good idea.
> >
> > But we should also support setting them the way Spark documents in
> > spark-env.sh and the two links Andrew found may solve that in a
> > maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> > function, which handles all env supplied setup. For the drivers it can be
> > done in the base class allowing and CLI overrides later. Then the
> SparkConf
> > is finally passed in to mahoutSparkContext where as little as possible is
> > changed in the conf.
> >
> > I’ll look at this for the drivers. Should be easy to add to the shell.
> >
> > On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> > IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> > parameters to the java startup call and all should be fine.
> >
> > On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
> > wrote:
> >
> >>
> >>
> >>
> >> I've run into this problem starting $ mahout shell-script.  i.e. needing
> >> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> > I've
> >> been temporarily hard coding them for now while developing.
> >>
> >> I'm just getting familiar with What you've done with the CLI drivers.
> > For
> >> #2 could we borrow option parsing code/methods from spark [1] [2] at
> each
> >> (spark) release and somehow add this to
> >> MahoutOptionParser.parseSparkOptions?
> >>
> >> I'll hopefully be doing some CLI work soon and have a better
> > understanding.
> >>
> >> [1]
> >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> >> [2]
> >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> >>
> >>> From: pat@occamsmachete.com
> >>> Subject: Spark options
> >>> Date: Wed, 5 Nov 2014 09:48:59 -0800
> >>> To: dev@mahout.apache.org
> >>>
> >>> Spark has a launch script as hadoop does. We use the Hadoop launcher
> >> script but not the Spark one. When starting up your Spark cluster there
> > is
> >> a spark-env.sh script that can set a bunch of environment variables. In
> > our
> >> own mahoutSparkContext function, which takes the place of the Spark
> > submit
> >> script and launcher we don’t account for most of the environment
> > variables.
> >>>
> >>> Unless I missed something this means most of the documented options
> will
> >> be ignored unless a user of Mahout parses and sets them in their own
> >> SparkConf. The Mahout CLI drivers don’t do this for all possible
> options,
> >> only supporting a few like job name and spark.executor.memory.
> >>>
> >>> The question is how to best handle these Spark options. There seem to
> be
> >> two options:
> >>> 1) use sparks launch mechanism for drivers but allow some to be
> >> overridden in the CLI
> >>> 2) add parsing the env for options and set up the SparkConf default in
> >> mahoutSparkContext with those variables.
> >>>
> >>> The downside of #2 is that as variables change we’ll have to reflect
> >> those in our code. I forget why #1 is not an option but Dmitriy has been
> >> consistently against this—in any case it would mean a fair bit of
> >> refactoring I believe.
> >>>
> >>> Any opinions or corrections?
> >>
> >>
> >>
> >
> >
>
>

Re: Spark options

Posted by Pat Ferrel <pa...@occamsmachete.com>.

The submit code is the only place that documents which are needed by clients AFAICT. It is pretty complicated and heavily laden with checks for which cluster manager is being used. I’d feel a lot better if we were using it. There is no way any of us are going to be able to test on all those configurations.

spark-env.sh is mostly for launching the cluster not the client but there seem to be exceptions like executor memory.


On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

these files if i read it correctly are for spawning yet another process. i
don't see how it may work for the shell.

I am also not convinced that spark-env is important for the client.


On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I was thinking -Dx=y too, seems like a good idea.
> 
> But we should also support setting them the way Spark documents in
> spark-env.sh and the two links Andrew found may solve that in a
> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> function, which handles all env supplied setup. For the drivers it can be
> done in the base class allowing and CLI overrides later. Then the SparkConf
> is finally passed in to mahoutSparkContext where as little as possible is
> changed in the conf.
> 
> I’ll look at this for the drivers. Should be easy to add to the shell.
> 
> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> parameters to the java startup call and all should be fine.
> 
> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
> 
>> 
>> 
>> 
>> I've run into this problem starting $ mahout shell-script.  i.e. needing
>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> I've
>> been temporarily hard coding them for now while developing.
>> 
>> I'm just getting familiar with What you've done with the CLI drivers.
> For
>> #2 could we borrow option parsing code/methods from spark [1] [2] at each
>> (spark) release and somehow add this to
>> MahoutOptionParser.parseSparkOptions?
>> 
>> I'll hopefully be doing some CLI work soon and have a better
> understanding.
>> 
>> [1]
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
>> [2]
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
>> 
>>> From: pat@occamsmachete.com
>>> Subject: Spark options
>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>>> To: dev@mahout.apache.org
>>> 
>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
>> script but not the Spark one. When starting up your Spark cluster there
> is
>> a spark-env.sh script that can set a bunch of environment variables. In
> our
>> own mahoutSparkContext function, which takes the place of the Spark
> submit
>> script and launcher we don’t account for most of the environment
> variables.
>>> 
>>> Unless I missed something this means most of the documented options will
>> be ignored unless a user of Mahout parses and sets them in their own
>> SparkConf. The Mahout CLI drivers don’t do this for all possible options,
>> only supporting a few like job name and spark.executor.memory.
>>> 
>>> The question is how to best handle these Spark options. There seem to be
>> two options:
>>> 1) use sparks launch mechanism for drivers but allow some to be
>> overridden in the CLI
>>> 2) add parsing the env for options and set up the SparkConf default in
>> mahoutSparkContext with those variables.
>>> 
>>> The downside of #2 is that as variables change we’ll have to reflect
>> those in our code. I forget why #1 is not an option but Dmitriy has been
>> consistently against this—in any case it would mean a fair bit of
>> refactoring I believe.
>>> 
>>> Any opinions or corrections?
>> 
>> 
>> 
> 
>

Re: Spark options

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

these files if i read it correctly are for spawning yet another process. i
don't see how it may work for the shell.

I am also not convinced that spark-env is important for the client.


On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I was thinking -Dx=y too, seems like a good idea.
>
> But we should also support setting them the way Spark documents in
> spark-env.sh and the two links Andrew found may solve that in a
> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> function, which handles all env supplied setup. For the drivers it can be
> done in the base class allowing and CLI overrides later. Then the SparkConf
> is finally passed in to mahoutSparkContext where as little as possible is
> changed in the conf.
>
> I’ll look at this for the drivers. Should be easy to add to the shell.
>
> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> parameters to the java startup call and all should be fine.
>
> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com>
> wrote:
>
> >
> >
> >
> > I've run into this problem starting $ mahout shell-script.  i.e. needing
> > to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> I've
> > been temporarily hard coding them for now while developing.
> >
> > I'm just getting familiar with What you've done with the CLI drivers.
> For
> > #2 could we borrow option parsing code/methods from spark [1] [2] at each
> > (spark) release and somehow add this to
> > MahoutOptionParser.parseSparkOptions?
> >
> > I'll hopefully be doing some CLI work soon and have a better
> understanding.
> >
> > [1]
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> > [2]
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> >
> >> From: pat@occamsmachete.com
> >> Subject: Spark options
> >> Date: Wed, 5 Nov 2014 09:48:59 -0800
> >> To: dev@mahout.apache.org
> >>
> >> Spark has a launch script as hadoop does. We use the Hadoop launcher
> > script but not the Spark one. When starting up your Spark cluster there
> is
> > a spark-env.sh script that can set a bunch of environment variables. In
> our
> > own mahoutSparkContext function, which takes the place of the Spark
> submit
> > script and launcher we don’t account for most of the environment
> variables.
> >>
> >> Unless I missed something this means most of the documented options will
> > be ignored unless a user of Mahout parses and sets them in their own
> > SparkConf. The Mahout CLI drivers don’t do this for all possible options,
> > only supporting a few like job name and spark.executor.memory.
> >>
> >> The question is how to best handle these Spark options. There seem to be
> > two options:
> >> 1) use sparks launch mechanism for drivers but allow some to be
> > overridden in the CLI
> >> 2) add parsing the env for options and set up the SparkConf default in
> > mahoutSparkContext with those variables.
> >>
> >> The downside of #2 is that as variables change we’ll have to reflect
> > those in our code. I forget why #1 is not an option but Dmitriy has been
> > consistently against this—in any case it would mean a fair bit of
> > refactoring I believe.
> >>
> >> Any opinions or corrections?
> >
> >
> >
>
>

Re: Spark options

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I was thinking -Dx=y too, seems like a good idea.

But we should also support setting them the way Spark documents in spark-env.sh and the two links Andrew found may solve that in a maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf function, which handles all env supplied setup. For the drivers it can be done in the base class allowing and CLI overrides later. Then the SparkConf is finally passed in to mahoutSparkContext where as little as possible is changed in the conf.

I’ll look at this for the drivers. Should be easy to add to the shell.

On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
parameters to the java startup call and all should be fine.

On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> 
> 
> 
> I've run into this problem starting $ mahout shell-script.  i.e. needing
> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.  I've
> been temporarily hard coding them for now while developing.
> 
> I'm just getting familiar with What you've done with the CLI drivers.  For
> #2 could we borrow option parsing code/methods from spark [1] [2] at each
> (spark) release and somehow add this to
> MahoutOptionParser.parseSparkOptions?
> 
> I'll hopefully be doing some CLI work soon and have a better understanding.
> 
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> [2]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> 
>> From: pat@occamsmachete.com
>> Subject: Spark options
>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>> To: dev@mahout.apache.org
>> 
>> Spark has a launch script as hadoop does. We use the Hadoop launcher
> script but not the Spark one. When starting up your Spark cluster there is
> a spark-env.sh script that can set a bunch of environment variables. In our
> own mahoutSparkContext function, which takes the place of the Spark submit
> script and launcher we don’t account for most of the environment variables.
>> 
>> Unless I missed something this means most of the documented options will
> be ignored unless a user of Mahout parses and sets them in their own
> SparkConf. The Mahout CLI drivers don’t do this for all possible options,
> only supporting a few like job name and spark.executor.memory.
>> 
>> The question is how to best handle these Spark options. There seem to be
> two options:
>> 1) use sparks launch mechanism for drivers but allow some to be
> overridden in the CLI
>> 2) add parsing the env for options and set up the SparkConf default in
> mahoutSparkContext with those variables.
>> 
>> The downside of #2 is that as variables change we’ll have to reflect
> those in our code. I forget why #1 is not an option but Dmitriy has been
> consistently against this—in any case it would mean a fair bit of
> refactoring I believe.
>> 
>> Any opinions or corrections?
> 
> 
>

Re: Spark options

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
parameters to the java startup call and all should be fine.

On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <ap...@outlook.com> wrote:

>
>
>
> I've run into this problem starting $ mahout shell-script.  i.e. needing
> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.  I've
> been temporarily hard coding them for now while developing.
>
> I'm just getting familiar with What you've done with the CLI drivers.  For
> #2 could we borrow option parsing code/methods from spark [1] [2] at each
> (spark) release and somehow add this to
> MahoutOptionParser.parseSparkOptions?
>
> I'll hopefully be doing some CLI work soon and have a better understanding.
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> [2]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
>
> > From: pat@occamsmachete.com
> > Subject: Spark options
> > Date: Wed, 5 Nov 2014 09:48:59 -0800
> > To: dev@mahout.apache.org
> >
> > Spark has a launch script as hadoop does. We use the Hadoop launcher
> script but not the Spark one. When starting up your Spark cluster there is
> a spark-env.sh script that can set a bunch of environment variables. In our
> own mahoutSparkContext function, which takes the place of the Spark submit
> script and launcher we don’t account for most of the environment variables.
> >
> > Unless I missed something this means most of the documented options will
> be ignored unless a user of Mahout parses and sets them in their own
> SparkConf. The Mahout CLI drivers don’t do this for all possible options,
> only supporting a few like job name and spark.executor.memory.
> >
> > The question is how to best handle these Spark options. There seem to be
> two options:
> > 1) use sparks launch mechanism for drivers but allow some to be
> overridden in the CLI
> > 2) add parsing the env for options and set up the SparkConf default in
> mahoutSparkContext with those variables.
> >
> > The downside of #2 is that as variables change we’ll have to reflect
> those in our code. I forget why #1 is not an option but Dmitriy has been
> consistently against this—in any case it would mean a fair bit of
> refactoring I believe.
> >
> > Any opinions or corrections?
>
>
>

RE: Spark options

Posted by Andrew Palumbo <ap...@outlook.com>.



I've run into this problem starting $ mahout shell-script.  i.e. needing to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.  I've been temporarily hard coding them for now while developing.    

I'm just getting familiar with What you've done with the CLI drivers.  For #2 could we borrow option parsing code/methods from spark [1] [2] at each (spark) release and somehow add this to MahoutOptionParser.parseSparkOptions?

I'll hopefully be doing some CLI work soon and have a better understanding.

[1]https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala  
[2]https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

> From: pat@occamsmachete.com
> Subject: Spark options
> Date: Wed, 5 Nov 2014 09:48:59 -0800
> To: dev@mahout.apache.org
> 
> Spark has a launch script as hadoop does. We use the Hadoop launcher script but not the Spark one. When starting up your Spark cluster there is a spark-env.sh script that can set a bunch of environment variables. In our own mahoutSparkContext function, which takes the place of the Spark submit script and launcher we don’t account for most of the environment variables.
> 
> Unless I missed something this means most of the documented options will be ignored unless a user of Mahout parses and sets them in their own SparkConf. The Mahout CLI drivers don’t do this for all possible options, only supporting a few like job name and spark.executor.memory.
> 
> The question is how to best handle these Spark options. There seem to be two options:
> 1) use sparks launch mechanism for drivers but allow some to be overridden in the CLI
> 2) add parsing the env for options and set up the SparkConf default in mahoutSparkContext with those variables. 
> 
> The downside of #2 is that as variables change we’ll have to reflect those in our code. I forget why #1 is not an option but Dmitriy has been consistently against this—in any case it would mean a fair bit of refactoring I believe.
> 
> Any opinions or corrections?