You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Shane Huang <sh...@gmail.com> on 2013/09/16 07:03:38 UTC

Propose to Re-organize the scripts and configurations

Having worked with a few customers, we found the current organization of
the scripts and configuration a bit confusing and inconvenient, so we
propose to reorganize them. Below we described the specific reasons and the
rough proposal, please kindly provide your opinions and suggestions. :)

Specific reasons for re-organization:
1) Usually the application developers/users and platform administrators
belongs to two teams. So it's better to separate the scripts used by
administrators and application users, e.g. put them in sbin and bin folders
respectively
2) User level options and admin level options need to be separated. For
example, an application user may never know how many spindles there are in
the cluster nodes, so it's often the administrator's duty to specify
spark.local.dir.
3) If there are multiple ways to specify an option, an overriding rule
should be present and should not be error-prone.
4) Currently the options are set and get using System property. It's hard
to manage and inconvenient for users. It's good to gather the options into
one file using format like xml or json.

Some previous work:
1) SPARK-544 contains a discussion about providing a configuration class
for spark https://spark-project.atlassian.net/browse/SPARK-5<https://spark-project.atlassian.net/browse/SPARK-544>
44
2) Ankur has gathered spark options into a json format
https://gist.github.com/ankurcha/5655646

Our rough proposal:

- Scripts

1. make an "sbin" folder containing all the scripts for administrators,
specifically,
- all service administration scripts, i.e. start-*, stop-*,
slaves.sh, *-daemons, *-daemon scripts
- low-level or internally used utility scripts, i.e.
compute-classpath, spark-config, spark-class, spark-executor
2. make a "bin" folder containing all the scripts for application
developers/users, specifically,
- user level app running scripts, i.e. pyspark, spark-shell, and we
propose to add a script "spark" for users to run applications (very much
like spark-class but may add some more control or convenient utilities)
- scripts for status checking, e.g. spark and hadoop version
checking, running applications checking, etc. We can make this a separate
script or add functionality to "spark" script.
3. No wandering scripts outside the sbin and bin folders

- Configurations/Options and overriding rule

1. Define a Configuration class which contains all the options available
for Spark application. A Configuration instance can be de-/serialized
from/to a json formatted file.
2. Each application (SparkContext) has one Configuration instance and it
is initialized by the application which creates it (either read from file
or passed from command line options or env SPARK_JAVA_OPTS).
3. When launching an Executor on a node, the Configuration is firstly
initialized using the node-local configuration file as default. The
Configuration passed from application driver context will override any
options specified in default.

Any comments are welcome.

--
shannie.huang@gmail.com
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by "shannie.huang" <sh...@gmail.com>.

I like the idea of using Typesafe Config. 

Nick, we'd be glad to work with you after we gathered enough opinions and come to a consensus of the approach. 

On 2013-9-16, at 16:52, Nick Pentreath <ni...@gmail.com> wrote:

> There was another discussion on the old dev list about this:
> https://groups.google.com/forum/#!msg/spark-developers/GL2_DwAeh5s/9rwQ3iDa2t4J
> 
> I tend to agree with having configuration sitting in JSON (or properties
> files) and using the Typesafe Config library which can parse both.
> 
> Something I've used in my apps is along these lines:
> https://gist.github.com/MLnick/6578146
> 
> It's then easy to have default config overridden with CLI for example:
> val conf = cliConf.withFallback(defaultConf)
> 
> I'd be happy to be involved in working on this if there is a consensus
> about best approach
> 
> N
> 
> 
> 
> 
> 
> On Mon, Sep 16, 2013 at 9:29 AM, Mike <sp...@good-with-numbers.com> wrote:
> 
>> Shane Huang wrote:
>>> we found the current organization of the scripts and configuration a
>>> bit confusing and inconvenient
>> 
>> ditto
>> 
>>> - Scripts
>> 
>> I wonder why the work of these scripts wasn't mostly done in Scala.
>> Seems roundabout to use Bash (or Python, in spark-perf) to calculate
>> shell environment variables that are then read back into Scala code.
>> 
>>> 1. Define a Configuration class which contains all the options
>>> available for Spark application. A Configuration instance can be
>>> de-/serialized from/to a json formatted file.
>>> 2. Each application (SparkContext) has one Configuration instance and
>>> it is initialized by the application which creates it (either read
>>> from file or passed from command line options or env SPARK_JAVA_OPTS).
>> 
>> Reminiscent of what Hibernate's been doing for the past decade.  Would
>> be nice if the Configuration was also exposed through an MBean or such
>> so that one can check it's values with certainty.
>>

Re: Propose to Re-organize the scripts and configurations

Posted by Nick Pentreath <ni...@gmail.com>.

There was another discussion on the old dev list about this:
https://groups.google.com/forum/#!msg/spark-developers/GL2_DwAeh5s/9rwQ3iDa2t4J

I tend to agree with having configuration sitting in JSON (or properties
files) and using the Typesafe Config library which can parse both.

Something I've used in my apps is along these lines:
https://gist.github.com/MLnick/6578146

It's then easy to have default config overridden with CLI for example:
val conf = cliConf.withFallback(defaultConf)

I'd be happy to be involved in working on this if there is a consensus
about best approach

N





On Mon, Sep 16, 2013 at 9:29 AM, Mike <sp...@good-with-numbers.com> wrote:

> Shane Huang wrote:
> > we found the current organization of the scripts and configuration a
> > bit confusing and inconvenient
>
> ditto
>
> > - Scripts
>
> I wonder why the work of these scripts wasn't mostly done in Scala.
> Seems roundabout to use Bash (or Python, in spark-perf) to calculate
> shell environment variables that are then read back into Scala code.
>
> > 1. Define a Configuration class which contains all the options
> > available for Spark application. A Configuration instance can be
> > de-/serialized from/to a json formatted file.
> > 2. Each application (SparkContext) has one Configuration instance and
> > it is initialized by the application which creates it (either read
> > from file or passed from command line options or env SPARK_JAVA_OPTS).
>
> Reminiscent of what Hibernate's been doing for the past decade.  Would
> be nice if the Configuration was also exposed through an MBean or such
> so that one can check it's values with certainty.
>

Re: Propose to Re-organize the scripts and configurations

Posted by Mike <sp...@good-with-numbers.com>.

> I wonder why the work of these scripts wasn't mostly done in Scala.  

After some sleep, I guess the answer's obvious: to set the "java" 
command line.

Re: Propose to Re-organize the scripts and configurations

Posted by Mike <sp...@good-with-numbers.com>.

Shane Huang wrote:
> we found the current organization of the scripts and configuration a 
> bit confusing and inconvenient

ditto

> - Scripts

I wonder why the work of these scripts wasn't mostly done in Scala.  
Seems roundabout to use Bash (or Python, in spark-perf) to calculate 
shell environment variables that are then read back into Scala code.

> 1. Define a Configuration class which contains all the options 
> available for Spark application. A Configuration instance can be 
> de-/serialized from/to a json formatted file.
> 2. Each application (SparkContext) has one Configuration instance and 
> it is initialized by the application which creates it (either read 
> from file or passed from command line options or env SPARK_JAVA_OPTS).

Reminiscent of what Hibernate's been doing for the past decade.  Would 
be nice if the Configuration was also exposed through an MBean or such 
so that one can check it's values with certainty.

RE: Propose to Re-organize the scripts and configurations

Posted by "Xia, Junluan" <ju...@intel.com>.

Shane and I will focus on configure and script these two features in next few days, and look forward to merging it to spark ASAP:)

-----Original Message-----
From: Shane Huang [mailto:shannie.huang@gmail.com] 
Sent: Sunday, September 22, 2013 12:13 PM
To: dev@spark.incubator.apache.org
Subject: Re: Propose to Re-organize the scripts and configurations

Done


On Sun, Sep 22, 2013 at 12:05 PM, Reynold Xin <rx...@cs.berkeley.edu> wrote:

> Thanks, Shane. Can you also link to this mailing list discussion from 
> the JIRA ticket?
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Sat, Sep 21, 2013 at 9:01 PM, Shane Huang <shannie.huang@gmail.com
> >wrote:
>
> > I summarized the opinions about Config in this post and added a 
> > comment
> on
> > SPARK-544.
> > Also post here below:
> >
> > 1) Define a Configuration class which contains all the options 
> > available for Spark application. A Configuration instance can be 
> > de-/serialized from/to a formatted file. Most of us tend to agree 
> > that Typesafe Config library is a good choice for the Configuration class.
> > 2) Each application (SparkContext) has one Configuration instance 
> > and it
> is
> > initialized by the application which creates it (either coded in app
> (apps
> > could explicitly read from io stream or command line arguments), or
> system
> > properties, or env vars).
> > 3) For an application the overriding rule should be code > system 
> > properties > env vars. Over time we will deprecate the env vars and 
> > maybe even system properties.
> > 4) When launching an Executor on a slave node, the Configuration is
> firstly
> > initialized using the node-local configuration file as default 
> > (instead
> of
> > the env vars at present), and then the Configuration passed from 
> > application driver context will override specific options specified 
> > in default. Certain options in app's Configuration will always 
> > override
> those
> > in node-local, because these options need to be the consistent 
> > across all the slave nodes, e.g. spark.serializer. In this case if 
> > any such options
> is
> > not set in app's Config, a value will be provided by the system. On 
> > the other hand, some options in app's Config will never override 
> > those in node-local. as they're not meat to be set in app, e.g. 
> > spark.local.dir
> >
> >
> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia 
> > <matei.zaharia@gmail.com
> > >wrote:
> >
> > > Hi Shane,
> > >
> > > I agree with all these points. Improving the configuration system 
> > > is
> one
> > > of the main things I'd like to have in the next release.
> > >
> > > > 1) Usually the application developers/users and platform
> administrators
> > > > belongs to two teams. So it's better to separate the scripts 
> > > > used by administrators and application users, e.g. put them in 
> > > > sbin and bin
> > > folders
> > > > respectively
> > >
> > > Yup, right now we don't have any attempt to install on standard 
> > > system paths.
> > >
> > > > 3) If there are multiple ways to specify an option, an 
> > > > overriding
> rule
> > > > should be present and should not be error-prone.
> > >
> > > Yes, I think this should always be Configuration class in code > 
> > > system properties > env vars. Over time we will deprecate the env 
> > > vars and
> maybe
> > > even system properties.
> > >
> > > > 4) Currently the options are set and get using System property. 
> > > > It's
> > hard
> > > > to manage and inconvenient for users. It's good to gather the 
> > > > options
> > > into
> > > > one file using format like xml or json.
> > >
> > > I think this is the main thing to do first -- pick one 
> > > configuration
> > class
> > > and change the code to use this.
> > >
> > > > Our rough proposal:
> > > >
> > > >   - Scripts
> > > >
> > > >   1. make an "sbin" folder containing all the scripts for
> > administrators,
> > > >   specifically,
> > > >      - all service administration scripts, i.e. start-*, stop-*,
> > > >      slaves.sh, *-daemons, *-daemon scripts
> > > >      - low-level or internally used utility scripts, i.e.
> > > >      compute-classpath, spark-config, spark-class, spark-executor
> > > >   2. make a "bin" folder containing all the scripts for application
> > > >   developers/users, specifically,
> > > >      - user level app  running scripts, i.e. pyspark, 
> > > > spark-shell,
> and
> > we
> > > >      propose to add a script "spark" for users to run 
> > > > applications
> > (very
> > > much
> > > >      like spark-class but may add some more control or 
> > > > convenient
> > > utilities)
> > > >      - scripts for status checking, e.g. spark and hadoop version
> > > >      checking, running applications checking, etc. We can make 
> > > > this a
> > > separate
> > > >      script or add functionality to "spark" script.
> > > >   3. No wandering scripts outside the sbin and bin folders
> > >
> > > Makes sense.
> > >
> > > >   -  Configurations/Options and overriding rule
> > > >
> > > >   1. Define a Configuration class which contains all the options
> > > available
> > > >   for Spark application. A Configuration instance can be
> de-/serialized
> > > >   from/to a json formatted file.
> > > >   2. Each application (SparkContext) has one Configuration 
> > > > instance
> and
> > > it
> > > >   is initialized by the application which creates it (either 
> > > > read
> from
> > > file
> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > > >   3. When launching an Executor on a node, the Configuration is
> firstly
> > > >   initialized using the node-local configuration file as default. The
> > > >   Configuration passed from application driver context will 
> > > > override
> > any
> > > >   options specified in default.
> > >
> > > This sounds great to me! The one thing I'll add is that we might 
> > > want
> to
> > > prevent applications from overriding certain settings on each 
> > > node,
> such
> > as
> > > work directories. The best way is to probably just ignore the 
> > > app's
> > version
> > > of those settings in the Executor.
> > >
> > > If you guys would like, feel free to write up this design on 
> > > SPARK-544
> > and
> > > start working on it. I think it looks good.
> > >
> > > Matei
> >
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
>



--
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

Done


On Sun, Sep 22, 2013 at 12:05 PM, Reynold Xin <rx...@cs.berkeley.edu> wrote:

> Thanks, Shane. Can you also link to this mailing list discussion from the
> JIRA ticket?
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Sat, Sep 21, 2013 at 9:01 PM, Shane Huang <shannie.huang@gmail.com
> >wrote:
>
> > I summarized the opinions about Config in this post and added a comment
> on
> > SPARK-544.
> > Also post here below:
> >
> > 1) Define a Configuration class which contains all the options available
> > for Spark application. A Configuration instance can be de-/serialized
> > from/to a formatted file. Most of us tend to agree that Typesafe Config
> > library is a good choice for the Configuration class.
> > 2) Each application (SparkContext) has one Configuration instance and it
> is
> > initialized by the application which creates it (either coded in app
> (apps
> > could explicitly read from io stream or command line arguments), or
> system
> > properties, or env vars).
> > 3) For an application the overriding rule should be code > system
> > properties > env vars. Over time we will deprecate the env vars and maybe
> > even system properties.
> > 4) When launching an Executor on a slave node, the Configuration is
> firstly
> > initialized using the node-local configuration file as default (instead
> of
> > the env vars at present), and then the Configuration passed from
> > application driver context will override specific options specified in
> > default. Certain options in app's Configuration will always override
> those
> > in node-local, because these options need to be the consistent across all
> > the slave nodes, e.g. spark.serializer. In this case if any such options
> is
> > not set in app's Config, a value will be provided by the system. On the
> > other hand, some options in app's Config will never override those in
> > node-local. as they're not meat to be set in app, e.g. spark.local.dir
> >
> >
> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <matei.zaharia@gmail.com
> > >wrote:
> >
> > > Hi Shane,
> > >
> > > I agree with all these points. Improving the configuration system is
> one
> > > of the main things I'd like to have in the next release.
> > >
> > > > 1) Usually the application developers/users and platform
> administrators
> > > > belongs to two teams. So it's better to separate the scripts used by
> > > > administrators and application users, e.g. put them in sbin and bin
> > > folders
> > > > respectively
> > >
> > > Yup, right now we don't have any attempt to install on standard system
> > > paths.
> > >
> > > > 3) If there are multiple ways to specify an option, an overriding
> rule
> > > > should be present and should not be error-prone.
> > >
> > > Yes, I think this should always be Configuration class in code > system
> > > properties > env vars. Over time we will deprecate the env vars and
> maybe
> > > even system properties.
> > >
> > > > 4) Currently the options are set and get using System property. It's
> > hard
> > > > to manage and inconvenient for users. It's good to gather the options
> > > into
> > > > one file using format like xml or json.
> > >
> > > I think this is the main thing to do first -- pick one configuration
> > class
> > > and change the code to use this.
> > >
> > > > Our rough proposal:
> > > >
> > > >   - Scripts
> > > >
> > > >   1. make an "sbin" folder containing all the scripts for
> > administrators,
> > > >   specifically,
> > > >      - all service administration scripts, i.e. start-*, stop-*,
> > > >      slaves.sh, *-daemons, *-daemon scripts
> > > >      - low-level or internally used utility scripts, i.e.
> > > >      compute-classpath, spark-config, spark-class, spark-executor
> > > >   2. make a "bin" folder containing all the scripts for application
> > > >   developers/users, specifically,
> > > >      - user level app  running scripts, i.e. pyspark, spark-shell,
> and
> > we
> > > >      propose to add a script "spark" for users to run applications
> > (very
> > > much
> > > >      like spark-class but may add some more control or convenient
> > > utilities)
> > > >      - scripts for status checking, e.g. spark and hadoop version
> > > >      checking, running applications checking, etc. We can make this a
> > > separate
> > > >      script or add functionality to "spark" script.
> > > >   3. No wandering scripts outside the sbin and bin folders
> > >
> > > Makes sense.
> > >
> > > >   -  Configurations/Options and overriding rule
> > > >
> > > >   1. Define a Configuration class which contains all the options
> > > available
> > > >   for Spark application. A Configuration instance can be
> de-/serialized
> > > >   from/to a json formatted file.
> > > >   2. Each application (SparkContext) has one Configuration instance
> and
> > > it
> > > >   is initialized by the application which creates it (either read
> from
> > > file
> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > > >   3. When launching an Executor on a node, the Configuration is
> firstly
> > > >   initialized using the node-local configuration file as default. The
> > > >   Configuration passed from application driver context will override
> > any
> > > >   options specified in default.
> > >
> > > This sounds great to me! The one thing I'll add is that we might want
> to
> > > prevent applications from overriding certain settings on each node,
> such
> > as
> > > work directories. The best way is to probably just ignore the app's
> > version
> > > of those settings in the Executor.
> > >
> > > If you guys would like, feel free to write up this design on SPARK-544
> > and
> > > start working on it. I think it looks good.
> > >
> > > Matei
> >
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
>



-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Reynold Xin <rx...@cs.berkeley.edu>.

Thanks, Shane. Can you also link to this mailing list discussion from the
JIRA ticket?


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Sat, Sep 21, 2013 at 9:01 PM, Shane Huang <sh...@gmail.com>wrote:

> I summarized the opinions about Config in this post and added a comment on
> SPARK-544.
> Also post here below:
>
> 1) Define a Configuration class which contains all the options available
> for Spark application. A Configuration instance can be de-/serialized
> from/to a formatted file. Most of us tend to agree that Typesafe Config
> library is a good choice for the Configuration class.
> 2) Each application (SparkContext) has one Configuration instance and it is
> initialized by the application which creates it (either coded in app (apps
> could explicitly read from io stream or command line arguments), or system
> properties, or env vars).
> 3) For an application the overriding rule should be code > system
> properties > env vars. Over time we will deprecate the env vars and maybe
> even system properties.
> 4) When launching an Executor on a slave node, the Configuration is firstly
> initialized using the node-local configuration file as default (instead of
> the env vars at present), and then the Configuration passed from
> application driver context will override specific options specified in
> default. Certain options in app's Configuration will always override those
> in node-local, because these options need to be the consistent across all
> the slave nodes, e.g. spark.serializer. In this case if any such options is
> not set in app's Config, a value will be provided by the system. On the
> other hand, some options in app's Config will never override those in
> node-local. as they're not meat to be set in app, e.g. spark.local.dir
>
>
> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <matei.zaharia@gmail.com
> >wrote:
>
> > Hi Shane,
> >
> > I agree with all these points. Improving the configuration system is one
> > of the main things I'd like to have in the next release.
> >
> > > 1) Usually the application developers/users and platform administrators
> > > belongs to two teams. So it's better to separate the scripts used by
> > > administrators and application users, e.g. put them in sbin and bin
> > folders
> > > respectively
> >
> > Yup, right now we don't have any attempt to install on standard system
> > paths.
> >
> > > 3) If there are multiple ways to specify an option, an overriding rule
> > > should be present and should not be error-prone.
> >
> > Yes, I think this should always be Configuration class in code > system
> > properties > env vars. Over time we will deprecate the env vars and maybe
> > even system properties.
> >
> > > 4) Currently the options are set and get using System property. It's
> hard
> > > to manage and inconvenient for users. It's good to gather the options
> > into
> > > one file using format like xml or json.
> >
> > I think this is the main thing to do first -- pick one configuration
> class
> > and change the code to use this.
> >
> > > Our rough proposal:
> > >
> > >   - Scripts
> > >
> > >   1. make an "sbin" folder containing all the scripts for
> administrators,
> > >   specifically,
> > >      - all service administration scripts, i.e. start-*, stop-*,
> > >      slaves.sh, *-daemons, *-daemon scripts
> > >      - low-level or internally used utility scripts, i.e.
> > >      compute-classpath, spark-config, spark-class, spark-executor
> > >   2. make a "bin" folder containing all the scripts for application
> > >   developers/users, specifically,
> > >      - user level app  running scripts, i.e. pyspark, spark-shell, and
> we
> > >      propose to add a script "spark" for users to run applications
> (very
> > much
> > >      like spark-class but may add some more control or convenient
> > utilities)
> > >      - scripts for status checking, e.g. spark and hadoop version
> > >      checking, running applications checking, etc. We can make this a
> > separate
> > >      script or add functionality to "spark" script.
> > >   3. No wandering scripts outside the sbin and bin folders
> >
> > Makes sense.
> >
> > >   -  Configurations/Options and overriding rule
> > >
> > >   1. Define a Configuration class which contains all the options
> > available
> > >   for Spark application. A Configuration instance can be de-/serialized
> > >   from/to a json formatted file.
> > >   2. Each application (SparkContext) has one Configuration instance and
> > it
> > >   is initialized by the application which creates it (either read from
> > file
> > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > >   3. When launching an Executor on a node, the Configuration is firstly
> > >   initialized using the node-local configuration file as default. The
> > >   Configuration passed from application driver context will override
> any
> > >   options specified in default.
> >
> > This sounds great to me! The one thing I'll add is that we might want to
> > prevent applications from overriding certain settings on each node, such
> as
> > work directories. The best way is to probably just ignore the app's
> version
> > of those settings in the Executor.
> >
> > If you guys would like, feel free to write up this design on SPARK-544
> and
> > start working on it. I think it looks good.
> >
> > Matei
>
>
>
>
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*
>

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

I summarized the opinions about Config in this post and added a comment on
SPARK-544.
Also post here below:

1) Define a Configuration class which contains all the options available
for Spark application. A Configuration instance can be de-/serialized
from/to a formatted file. Most of us tend to agree that Typesafe Config
library is a good choice for the Configuration class.
2) Each application (SparkContext) has one Configuration instance and it is
initialized by the application which creates it (either coded in app (apps
could explicitly read from io stream or command line arguments), or system
properties, or env vars).
3) For an application the overriding rule should be code > system
properties > env vars. Over time we will deprecate the env vars and maybe
even system properties.
4) When launching an Executor on a slave node, the Configuration is firstly
initialized using the node-local configuration file as default (instead of
the env vars at present), and then the Configuration passed from
application driver context will override specific options specified in
default. Certain options in app's Configuration will always override those
in node-local, because these options need to be the consistent across all
the slave nodes, e.g. spark.serializer. In this case if any such options is
not set in app's Config, a value will be provided by the system. On the
other hand, some options in app's Config will never override those in
node-local. as they're not meat to be set in app, e.g. spark.local.dir

On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Hi Shane,
>
> I agree with all these points. Improving the configuration system is one
> of the main things I'd like to have in the next release.
>
> > 1) Usually the application developers/users and platform administrators
> > belongs to two teams. So it's better to separate the scripts used by
> > administrators and application users, e.g. put them in sbin and bin
> folders
> > respectively
>
> Yup, right now we don't have any attempt to install on standard system
> paths.
>
> > 3) If there are multiple ways to specify an option, an overriding rule
> > should be present and should not be error-prone.
>
> Yes, I think this should always be Configuration class in code > system
> properties > env vars. Over time we will deprecate the env vars and maybe
> even system properties.
>
> > 4) Currently the options are set and get using System property. It's hard
> > to manage and inconvenient for users. It's good to gather the options
> into
> > one file using format like xml or json.
>
> I think this is the main thing to do first -- pick one configuration class
> and change the code to use this.
>
> > Our rough proposal:
> >
> >   - Scripts
> >
> >   1. make an "sbin" folder containing all the scripts for administrators,
> >   specifically,
> >      - all service administration scripts, i.e. start-*, stop-*,
> >      slaves.sh, *-daemons, *-daemon scripts
> >      - low-level or internally used utility scripts, i.e.
> >      compute-classpath, spark-config, spark-class, spark-executor
> >   2. make a "bin" folder containing all the scripts for application
> >   developers/users, specifically,
> >      - user level app  running scripts, i.e. pyspark, spark-shell, and we
> >      propose to add a script "spark" for users to run applications (very
> much
> >      like spark-class but may add some more control or convenient
> utilities)
> >      - scripts for status checking, e.g. spark and hadoop version
> >      checking, running applications checking, etc. We can make this a
> separate
> >      script or add functionality to "spark" script.
> >   3. No wandering scripts outside the sbin and bin folders
>
> Makes sense.
>
> >   -  Configurations/Options and overriding rule
> >
> >   1. Define a Configuration class which contains all the options
> available
> >   for Spark application. A Configuration instance can be de-/serialized
> >   from/to a json formatted file.
> >   2. Each application (SparkContext) has one Configuration instance and
> it
> >   is initialized by the application which creates it (either read from
> file
> >   or passed from command line options or env SPARK_JAVA_OPTS).
> >   3. When launching an Executor on a node, the Configuration is firstly
> >   initialized using the node-local configuration file as default. The
> >   Configuration passed from application driver context will override any
> >   options specified in default.
>
> This sounds great to me! The one thing I'll add is that we might want to
> prevent applications from overriding certain settings on each node, such as
> work directories. The best way is to probably just ignore the app's version
> of those settings in the Executor.
>
> If you guys would like, feel free to write up this design on SPARK-544 and
> start working on it. I think it looks good.
>
> Matei

-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

RE: Propose to Re-organize the scripts and configurations

Posted by "Xia, Junluan" <ju...@intel.com>.

Hi Metei

Shane is on vacation now. I will take charge of this pull request.

-----Original Message-----
From: Matei Zaharia [mailto:matei.zaharia@gmail.com] 
Sent: Thursday, October 10, 2013 1:36 AM
To: dev@spark.incubator.apache.org
Cc: Shane Huang
Subject: Re: Propose to Re-organize the scripts and configurations

Hey Shane, I don't know if you saw my message on GitHub, but I did review this a few days ago: https://github.com/apache/incubator-spark/pull/21. Make sure you're allowing emails from GitHub to get comments. It looks good overall but I had some suggestions in there.

Matei

On Sep 26, 2013, at 7:24 PM, Shane Huang <sh...@gmail.com> wrote:

> I have created a pull request to address the basic needs of our 
> customer for separating the admin and user scripts. Link here 
> https://github.com/apache/incubator-spark/pull/21. Please kindly review.
> And we can also discuss if there's more functionality needed.
> 
> 
> On Sun, Sep 22, 2013 at 12:07 PM, Shane Huang <sh...@gmail.com>wrote:
> 
>> And I created a new issue SPARK-915 to track the re-org of scripts as
>> SPARK-544 only talks about Config.
>> https://spark-project.atlassian.net/browse/SPARK-915
>> 
>> 
>> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:
>> 
>>> Hi Shane,
>>> 
>>> I agree with all these points. Improving the configuration system is 
>>> one of the main things I'd like to have in the next release.
>>> 
>>>> 1) Usually the application developers/users and platform 
>>>> administrators belongs to two teams. So it's better to separate the 
>>>> scripts used by administrators and application users, e.g. put them 
>>>> in sbin and bin
>>> folders
>>>> respectively
>>> 
>>> Yup, right now we don't have any attempt to install on standard 
>>> system paths.
>>> 
>>>> 3) If there are multiple ways to specify an option, an overriding 
>>>> rule should be present and should not be error-prone.
>>> 
>>> Yes, I think this should always be Configuration class in code > 
>>> system properties > env vars. Over time we will deprecate the env 
>>> vars and maybe even system properties.
>>> 
>>>> 4) Currently the options are set and get using System property. 
>>>> It's
>>> hard
>>>> to manage and inconvenient for users. It's good to gather the 
>>>> options
>>> into
>>>> one file using format like xml or json.
>>> 
>>> I think this is the main thing to do first -- pick one configuration 
>>> class and change the code to use this.
>>> 
>>>> Our rough proposal:
>>>> 
>>>>  - Scripts
>>>> 
>>>>  1. make an "sbin" folder containing all the scripts for
>>> administrators,
>>>>  specifically,
>>>>     - all service administration scripts, i.e. start-*, stop-*,
>>>>     slaves.sh, *-daemons, *-daemon scripts
>>>>     - low-level or internally used utility scripts, i.e.
>>>>     compute-classpath, spark-config, spark-class, spark-executor  
>>>> 2. make a "bin" folder containing all the scripts for application  
>>>> developers/users, specifically,
>>>>     - user level app  running scripts, i.e. pyspark, spark-shell, 
>>>> and
>>> we
>>>>     propose to add a script "spark" for users to run applications
>>> (very much
>>>>     like spark-class but may add some more control or convenient
>>> utilities)
>>>>     - scripts for status checking, e.g. spark and hadoop version
>>>>     checking, running applications checking, etc. We can make this 
>>>> a
>>> separate
>>>>     script or add functionality to "spark" script.
>>>>  3. No wandering scripts outside the sbin and bin folders
>>> 
>>> Makes sense.
>>> 
>>>>  -  Configurations/Options and overriding rule
>>>> 
>>>>  1. Define a Configuration class which contains all the options
>>> available
>>>>  for Spark application. A Configuration instance can be 
>>>> de-/serialized  from/to a json formatted file.
>>>>  2. Each application (SparkContext) has one Configuration instance 
>>>> and
>>> it
>>>>  is initialized by the application which creates it (either read 
>>>> from
>>> file
>>>>  or passed from command line options or env SPARK_JAVA_OPTS).
>>>>  3. When launching an Executor on a node, the Configuration is 
>>>> firstly  initialized using the node-local configuration file as 
>>>> default. The  Configuration passed from application driver context 
>>>> will override any  options specified in default.
>>> 
>>> This sounds great to me! The one thing I'll add is that we might 
>>> want to prevent applications from overriding certain settings on 
>>> each node, such as work directories. The best way is to probably 
>>> just ignore the app's version of those settings in the Executor.
>>> 
>>> If you guys would like, feel free to write up this design on 
>>> SPARK-544 and start working on it. I think it looks good.
>>> 
>>> Matei
>> 
>> 
>> 
>> 
>> --
>> *Shane Huang *
>> *Intel Asia-Pacific R&D Ltd.*
>> *Email: shengsheng.huang@intel.com*
>> 
>> 
> 
> 
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Matei Zaharia <ma...@gmail.com>.

Hey Shane, I don't know if you saw my message on GitHub, but I did review this a few days ago: https://github.com/apache/incubator-spark/pull/21. Make sure you're allowing emails from GitHub to get comments. It looks good overall but I had some suggestions in there.

Matei

On Sep 26, 2013, at 7:24 PM, Shane Huang <sh...@gmail.com> wrote:

> I have created a pull request to address the basic needs of our customer
> for separating the admin and user scripts. Link here
> https://github.com/apache/incubator-spark/pull/21. Please kindly review.
> And we can also discuss if there's more functionality needed.
> 
> 
> On Sun, Sep 22, 2013 at 12:07 PM, Shane Huang <sh...@gmail.com>wrote:
> 
>> And I created a new issue SPARK-915 to track the re-org of scripts as
>> SPARK-544 only talks about Config.
>> https://spark-project.atlassian.net/browse/SPARK-915
>> 
>> 
>> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:
>> 
>>> Hi Shane,
>>> 
>>> I agree with all these points. Improving the configuration system is one
>>> of the main things I'd like to have in the next release.
>>> 
>>>> 1) Usually the application developers/users and platform administrators
>>>> belongs to two teams. So it's better to separate the scripts used by
>>>> administrators and application users, e.g. put them in sbin and bin
>>> folders
>>>> respectively
>>> 
>>> Yup, right now we don't have any attempt to install on standard system
>>> paths.
>>> 
>>>> 3) If there are multiple ways to specify an option, an overriding rule
>>>> should be present and should not be error-prone.
>>> 
>>> Yes, I think this should always be Configuration class in code > system
>>> properties > env vars. Over time we will deprecate the env vars and maybe
>>> even system properties.
>>> 
>>>> 4) Currently the options are set and get using System property. It's
>>> hard
>>>> to manage and inconvenient for users. It's good to gather the options
>>> into
>>>> one file using format like xml or json.
>>> 
>>> I think this is the main thing to do first -- pick one configuration
>>> class and change the code to use this.
>>> 
>>>> Our rough proposal:
>>>> 
>>>>  - Scripts
>>>> 
>>>>  1. make an "sbin" folder containing all the scripts for
>>> administrators,
>>>>  specifically,
>>>>     - all service administration scripts, i.e. start-*, stop-*,
>>>>     slaves.sh, *-daemons, *-daemon scripts
>>>>     - low-level or internally used utility scripts, i.e.
>>>>     compute-classpath, spark-config, spark-class, spark-executor
>>>>  2. make a "bin" folder containing all the scripts for application
>>>>  developers/users, specifically,
>>>>     - user level app  running scripts, i.e. pyspark, spark-shell, and
>>> we
>>>>     propose to add a script "spark" for users to run applications
>>> (very much
>>>>     like spark-class but may add some more control or convenient
>>> utilities)
>>>>     - scripts for status checking, e.g. spark and hadoop version
>>>>     checking, running applications checking, etc. We can make this a
>>> separate
>>>>     script or add functionality to "spark" script.
>>>>  3. No wandering scripts outside the sbin and bin folders
>>> 
>>> Makes sense.
>>> 
>>>>  -  Configurations/Options and overriding rule
>>>> 
>>>>  1. Define a Configuration class which contains all the options
>>> available
>>>>  for Spark application. A Configuration instance can be de-/serialized
>>>>  from/to a json formatted file.
>>>>  2. Each application (SparkContext) has one Configuration instance and
>>> it
>>>>  is initialized by the application which creates it (either read from
>>> file
>>>>  or passed from command line options or env SPARK_JAVA_OPTS).
>>>>  3. When launching an Executor on a node, the Configuration is firstly
>>>>  initialized using the node-local configuration file as default. The
>>>>  Configuration passed from application driver context will override any
>>>>  options specified in default.
>>> 
>>> This sounds great to me! The one thing I'll add is that we might want to
>>> prevent applications from overriding certain settings on each node, such as
>>> work directories. The best way is to probably just ignore the app's version
>>> of those settings in the Executor.
>>> 
>>> If you guys would like, feel free to write up this design on SPARK-544
>>> and start working on it. I think it looks good.
>>> 
>>> Matei
>> 
>> 
>> 
>> 
>> --
>> *Shane Huang *
>> *Intel Asia-Pacific R&D Ltd.*
>> *Email: shengsheng.huang@intel.com*
>> 
>> 
> 
> 
> -- 
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

I have created a pull request to address the basic needs of our customer
for separating the admin and user scripts. Link here
https://github.com/apache/incubator-spark/pull/21. Please kindly review.
And we can also discuss if there's more functionality needed.


On Sun, Sep 22, 2013 at 12:07 PM, Shane Huang <sh...@gmail.com>wrote:

> And I created a new issue SPARK-915 to track the re-org of scripts as
> SPARK-544 only talks about Config.
> https://spark-project.atlassian.net/browse/SPARK-915
>
>
> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:
>
>> Hi Shane,
>>
>> I agree with all these points. Improving the configuration system is one
>> of the main things I'd like to have in the next release.
>>
>> > 1) Usually the application developers/users and platform administrators
>> > belongs to two teams. So it's better to separate the scripts used by
>> > administrators and application users, e.g. put them in sbin and bin
>> folders
>> > respectively
>>
>> Yup, right now we don't have any attempt to install on standard system
>> paths.
>>
>> > 3) If there are multiple ways to specify an option, an overriding rule
>> > should be present and should not be error-prone.
>>
>> Yes, I think this should always be Configuration class in code > system
>> properties > env vars. Over time we will deprecate the env vars and maybe
>> even system properties.
>>
>> > 4) Currently the options are set and get using System property. It's
>> hard
>> > to manage and inconvenient for users. It's good to gather the options
>> into
>> > one file using format like xml or json.
>>
>> I think this is the main thing to do first -- pick one configuration
>> class and change the code to use this.
>>
>> > Our rough proposal:
>> >
>> >   - Scripts
>> >
>> >   1. make an "sbin" folder containing all the scripts for
>> administrators,
>> >   specifically,
>> >      - all service administration scripts, i.e. start-*, stop-*,
>> >      slaves.sh, *-daemons, *-daemon scripts
>> >      - low-level or internally used utility scripts, i.e.
>> >      compute-classpath, spark-config, spark-class, spark-executor
>> >   2. make a "bin" folder containing all the scripts for application
>> >   developers/users, specifically,
>> >      - user level app  running scripts, i.e. pyspark, spark-shell, and
>> we
>> >      propose to add a script "spark" for users to run applications
>> (very much
>> >      like spark-class but may add some more control or convenient
>> utilities)
>> >      - scripts for status checking, e.g. spark and hadoop version
>> >      checking, running applications checking, etc. We can make this a
>> separate
>> >      script or add functionality to "spark" script.
>> >   3. No wandering scripts outside the sbin and bin folders
>>
>> Makes sense.
>>
>> >   -  Configurations/Options and overriding rule
>> >
>> >   1. Define a Configuration class which contains all the options
>> available
>> >   for Spark application. A Configuration instance can be de-/serialized
>> >   from/to a json formatted file.
>> >   2. Each application (SparkContext) has one Configuration instance and
>> it
>> >   is initialized by the application which creates it (either read from
>> file
>> >   or passed from command line options or env SPARK_JAVA_OPTS).
>> >   3. When launching an Executor on a node, the Configuration is firstly
>> >   initialized using the node-local configuration file as default. The
>> >   Configuration passed from application driver context will override any
>> >   options specified in default.
>>
>> This sounds great to me! The one thing I'll add is that we might want to
>> prevent applications from overriding certain settings on each node, such as
>> work directories. The best way is to probably just ignore the app's version
>> of those settings in the Executor.
>>
>> If you guys would like, feel free to write up this design on SPARK-544
>> and start working on it. I think it looks good.
>>
>> Matei
>
>
>
>
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*
>
>


-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Chester <ch...@yahoo.com>.

+1 for Typesafe configFactory and config

Sent from my iPad

On Sep 25, 2013, at 11:42 PM, Evan Chan <ev...@ooyala.com> wrote:

> Shane, and others,
> 
> Let's work together on the configuration thing.   I had proposed in a
> separate thread to use Typesafe Config to hold all configuration
> (essentially a configuration class, but which can read from both JSON files
> as well as -D java command line args).
> 
> Typesafe Config works much much better than a simple config class, and also
> better than Hadoop configs.  It also has advantages over JSON (more
> readable, comments).   It would also be the easiest to transition from the
> current scheme, since the current java system properties can be seamlessly
> integrated.
> 
> I would be happy to contribute this back soon because it is also a big pain
> point for us.  I also have extensive experience with both Typesafe Config
> and other config systems.
> 
> I would definitely start with SparkContext and work our way out from there.
>   In fact I can submit a patch for everyone to test out fairly quickly
> just for SparkContext.
> 
> -Evan
> 
> 
> 
> On Tue, Sep 24, 2013 at 10:26 PM, Shane Huang <sh...@gmail.com>wrote:
> 
>> I think it's good to have Bigtop to package Spark. But in this track we're
>> just targeting enhancing the usability of Spark itself without Bigtop.
>> After all, few of our customers used Bigtop.
>> 
>> 
>> On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <shannie.huang@gmail.com
>>> wrote:
>> 
>>> I think it's good to have Bigtop to package Spark. But I
>>> 
>>> 
>>> On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <cos@apache.org
>>> wrote:
>>> 
>>>> Late to the game, but... Bigtop is packaging Spark now as a part of the
>>>> standard distribution - our release 0.7.0 is around the corner. And we
>> do
>>>> it
>>>> in the same way that has been done for Hadoop. Perhaps it worth looking
>>>> into...
>>>> 
>>>> Lemme know if you have any questions,
>>>>  Cos
>>>> 
>>>> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
>>>>> And I created a new issue SPARK-915 to track the re-org of scripts as
>>>>> SPARK-544 only talks about Config.
>>>>> https://spark-project.atlassian.net/browse/SPARK-915
>>>>> 
>>>>> 
>>>>> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <
>> matei.zaharia@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> Hi Shane,
>>>>>> 
>>>>>> I agree with all these points. Improving the configuration system is
>>>> one
>>>>>> of the main things I'd like to have in the next release.
>>>>>> 
>>>>>>> 1) Usually the application developers/users and platform
>>>> administrators
>>>>>>> belongs to two teams. So it's better to separate the scripts used
>> by
>>>>>>> administrators and application users, e.g. put them in sbin and
>> bin
>>>>>> folders
>>>>>>> respectively
>>>>>> 
>>>>>> Yup, right now we don't have any attempt to install on standard
>> system
>>>>>> paths.
>>>>>> 
>>>>>>> 3) If there are multiple ways to specify an option, an overriding
>>>> rule
>>>>>>> should be present and should not be error-prone.
>>>>>> 
>>>>>> Yes, I think this should always be Configuration class in code >
>>>> system
>>>>>> properties > env vars. Over time we will deprecate the env vars and
>>>> maybe
>>>>>> even system properties.
>>>>>> 
>>>>>>> 4) Currently the options are set and get using System property.
>>>> It's hard
>>>>>>> to manage and inconvenient for users. It's good to gather the
>>>> options
>>>>>> into
>>>>>>> one file using format like xml or json.
>>>>>> 
>>>>>> I think this is the main thing to do first -- pick one configuration
>>>> class
>>>>>> and change the code to use this.
>>>>>> 
>>>>>>> Our rough proposal:
>>>>>>> 
>>>>>>>  - Scripts
>>>>>>> 
>>>>>>>  1. make an "sbin" folder containing all the scripts for
>>>> administrators,
>>>>>>>  specifically,
>>>>>>>     - all service administration scripts, i.e. start-*, stop-*,
>>>>>>>     slaves.sh, *-daemons, *-daemon scripts
>>>>>>>     - low-level or internally used utility scripts, i.e.
>>>>>>>     compute-classpath, spark-config, spark-class, spark-executor
>>>>>>>  2. make a "bin" folder containing all the scripts for
>> application
>>>>>>>  developers/users, specifically,
>>>>>>>     - user level app  running scripts, i.e. pyspark, spark-shell,
>>>> and we
>>>>>>>     propose to add a script "spark" for users to run applications
>>>> (very
>>>>>> much
>>>>>>>     like spark-class but may add some more control or convenient
>>>>>> utilities)
>>>>>>>     - scripts for status checking, e.g. spark and hadoop version
>>>>>>>     checking, running applications checking, etc. We can make
>> this
>>>> a
>>>>>> separate
>>>>>>>     script or add functionality to "spark" script.
>>>>>>>  3. No wandering scripts outside the sbin and bin folders
>>>>>> 
>>>>>> Makes sense.
>>>>>> 
>>>>>>>  -  Configurations/Options and overriding rule
>>>>>>> 
>>>>>>>  1. Define a Configuration class which contains all the options
>>>>>> available
>>>>>>>  for Spark application. A Configuration instance can be
>>>> de-/serialized
>>>>>>>  from/to a json formatted file.
>>>>>>>  2. Each application (SparkContext) has one Configuration
>> instance
>>>> and
>>>>>> it
>>>>>>>  is initialized by the application which creates it (either read
>>>> from
>>>>>> file
>>>>>>>  or passed from command line options or env SPARK_JAVA_OPTS).
>>>>>>>  3. When launching an Executor on a node, the Configuration is
>>>> firstly
>>>>>>>  initialized using the node-local configuration file as default.
>>>> The
>>>>>>>  Configuration passed from application driver context will
>>>> override any
>>>>>>>  options specified in default.
>>>>>> 
>>>>>> This sounds great to me! The one thing I'll add is that we might
>> want
>>>> to
>>>>>> prevent applications from overriding certain settings on each node,
>>>> such as
>>>>>> work directories. The best way is to probably just ignore the app's
>>>> version
>>>>>> of those settings in the Executor.
>>>>>> 
>>>>>> If you guys would like, feel free to write up this design on
>>>> SPARK-544 and
>>>>>> start working on it. I think it looks good.
>>>>>> 
>>>>>> Matei
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *Shane Huang *
>>>>> *Intel Asia-Pacific R&D Ltd.*
>>>>> *Email: shengsheng.huang@intel.com*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *Shane Huang *
>>> *Intel Asia-Pacific R&D Ltd.*
>>> *Email: shengsheng.huang@intel.com*
>>> 
>>> 
>> 
>> 
>> --
>> *Shane Huang *
>> *Intel Asia-Pacific R&D Ltd.*
>> *Email: shengsheng.huang@intel.com*
>> 
> 
> 
> 
> -- 
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
> 
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: Propose to Re-organize the scripts and configurations

Posted by Evan Chan <ev...@ooyala.com>.

Hi Shane, Junluan,

Definitely, let's cooperate.  Should we chat offline?

-Evan



On Thu, Sep 26, 2013 at 1:24 AM, Xia, Junluan <ju...@intel.com> wrote:

> Hi Chan
>
> Shane and I happened to try to contribute configure feature to spark,
> could we cooperate to implement it?
>
> -----Original Message-----
> From: Evan Chan [mailto:ev@ooyala.com]
> Sent: Thursday, September 26, 2013 2:43 PM
> To: dev@spark.incubator.apache.org
> Subject: Re: Propose to Re-organize the scripts and configurations
>
> Shane, and others,
>
> Let's work together on the configuration thing.   I had proposed in a
> separate thread to use Typesafe Config to hold all configuration
> (essentially a configuration class, but which can read from both JSON files
> as well as -D java command line args).
>
> Typesafe Config works much much better than a simple config class, and
> also better than Hadoop configs.  It also has advantages over JSON (more
> readable, comments).   It would also be the easiest to transition from the
> current scheme, since the current java system properties can be seamlessly
> integrated.
>
> I would be happy to contribute this back soon because it is also a big
> pain point for us.  I also have extensive experience with both Typesafe
> Config and other config systems.
>
> I would definitely start with SparkContext and work our way out from there.
>    In fact I can submit a patch for everyone to test out fairly quickly
> just for SparkContext.
>
> -Evan
>
>
>
> On Tue, Sep 24, 2013 at 10:26 PM, Shane Huang <shannie.huang@gmail.com
> >wrote:
>
> > I think it's good to have Bigtop to package Spark. But in this track
> > we're just targeting enhancing the usability of Spark itself without
> Bigtop.
> >  After all, few of our customers used Bigtop.
> >
> >
> > On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <shannie.huang@gmail.com
> > >wrote:
> >
> > > I think it's good to have Bigtop to package Spark. But I
> > >
> > >
> > > On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <cos@apache.org
> > >wrote:
> > >
> > >> Late to the game, but... Bigtop is packaging Spark now as a part of
> > >> the standard distribution - our release 0.7.0 is around the corner.
> > >> And we
> > do
> > >> it
> > >> in the same way that has been done for Hadoop. Perhaps it worth
> > >> looking into...
> > >>
> > >> Lemme know if you have any questions,
> > >>   Cos
> > >>
> > >> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> > >> > And I created a new issue SPARK-915 to track the re-org of
> > >> > scripts as
> > >> > SPARK-544 only talks about Config.
> > >> > https://spark-project.atlassian.net/browse/SPARK-915
> > >> >
> > >> >
> > >> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <
> > matei.zaharia@gmail.com
> > >> >wrote:
> > >> >
> > >> > > Hi Shane,
> > >> > >
> > >> > > I agree with all these points. Improving the configuration
> > >> > > system is
> > >> one
> > >> > > of the main things I'd like to have in the next release.
> > >> > >
> > >> > > > 1) Usually the application developers/users and platform
> > >> administrators
> > >> > > > belongs to two teams. So it's better to separate the scripts
> > >> > > > used
> > by
> > >> > > > administrators and application users, e.g. put them in sbin
> > >> > > > and
> > bin
> > >> > > folders
> > >> > > > respectively
> > >> > >
> > >> > > Yup, right now we don't have any attempt to install on standard
> > system
> > >> > > paths.
> > >> > >
> > >> > > > 3) If there are multiple ways to specify an option, an
> > >> > > > overriding
> > >> rule
> > >> > > > should be present and should not be error-prone.
> > >> > >
> > >> > > Yes, I think this should always be Configuration class in code
> > >> > > >
> > >> system
> > >> > > properties > env vars. Over time we will deprecate the env vars
> > >> > > and
> > >> maybe
> > >> > > even system properties.
> > >> > >
> > >> > > > 4) Currently the options are set and get using System property.
> > >> It's hard
> > >> > > > to manage and inconvenient for users. It's good to gather the
> > >> options
> > >> > > into
> > >> > > > one file using format like xml or json.
> > >> > >
> > >> > > I think this is the main thing to do first -- pick one
> > >> > > configuration
> > >> class
> > >> > > and change the code to use this.
> > >> > >
> > >> > > > Our rough proposal:
> > >> > > >
> > >> > > >   - Scripts
> > >> > > >
> > >> > > >   1. make an "sbin" folder containing all the scripts for
> > >> administrators,
> > >> > > >   specifically,
> > >> > > >      - all service administration scripts, i.e. start-*, stop-*,
> > >> > > >      slaves.sh, *-daemons, *-daemon scripts
> > >> > > >      - low-level or internally used utility scripts, i.e.
> > >> > > >      compute-classpath, spark-config, spark-class,
> spark-executor
> > >> > > >   2. make a "bin" folder containing all the scripts for
> > application
> > >> > > >   developers/users, specifically,
> > >> > > >      - user level app  running scripts, i.e. pyspark,
> > >> > > > spark-shell,
> > >> and we
> > >> > > >      propose to add a script "spark" for users to run
> > >> > > > applications
> > >> (very
> > >> > > much
> > >> > > >      like spark-class but may add some more control or
> > >> > > > convenient
> > >> > > utilities)
> > >> > > >      - scripts for status checking, e.g. spark and hadoop
> version
> > >> > > >      checking, running applications checking, etc. We can
> > >> > > > make
> > this
> > >> a
> > >> > > separate
> > >> > > >      script or add functionality to "spark" script.
> > >> > > >   3. No wandering scripts outside the sbin and bin folders
> > >> > >
> > >> > > Makes sense.
> > >> > >
> > >> > > >   -  Configurations/Options and overriding rule
> > >> > > >
> > >> > > >   1. Define a Configuration class which contains all the
> > >> > > > options
> > >> > > available
> > >> > > >   for Spark application. A Configuration instance can be
> > >> de-/serialized
> > >> > > >   from/to a json formatted file.
> > >> > > >   2. Each application (SparkContext) has one Configuration
> > instance
> > >> and
> > >> > > it
> > >> > > >   is initialized by the application which creates it (either
> > >> > > > read
> > >> from
> > >> > > file
> > >> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > >> > > >   3. When launching an Executor on a node, the Configuration
> > >> > > > is
> > >> firstly
> > >> > > >   initialized using the node-local configuration file as
> default.
> > >> The
> > >> > > >   Configuration passed from application driver context will
> > >> override any
> > >> > > >   options specified in default.
> > >> > >
> > >> > > This sounds great to me! The one thing I'll add is that we
> > >> > > might
> > want
> > >> to
> > >> > > prevent applications from overriding certain settings on each
> > >> > > node,
> > >> such as
> > >> > > work directories. The best way is to probably just ignore the
> > >> > > app's
> > >> version
> > >> > > of those settings in the Executor.
> > >> > >
> > >> > > If you guys would like, feel free to write up this design on
> > >> SPARK-544 and
> > >> > > start working on it. I think it looks good.
> > >> > >
> > >> > > Matei
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > *Shane Huang *
> > >> > *Intel Asia-Pacific R&D Ltd.*
> > >> > *Email: shengsheng.huang@intel.com*
> > >>
> > >
> > >
> > >
> > > --
> > > *Shane Huang *
> > > *Intel Asia-Pacific R&D Ltd.*
> > > *Email: shengsheng.huang@intel.com*
> > >
> > >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

RE: Propose to Re-organize the scripts and configurations

Posted by "Xia, Junluan" <ju...@intel.com>.

Hi Chan

Shane and I happened to try to contribute configure feature to spark, could we cooperate to implement it?

-----Original Message-----
From: Evan Chan [mailto:ev@ooyala.com] 
Sent: Thursday, September 26, 2013 2:43 PM
To: dev@spark.incubator.apache.org
Subject: Re: Propose to Re-organize the scripts and configurations

Shane, and others,

Let's work together on the configuration thing.   I had proposed in a
separate thread to use Typesafe Config to hold all configuration (essentially a configuration class, but which can read from both JSON files as well as -D java command line args).

Typesafe Config works much much better than a simple config class, and also better than Hadoop configs.  It also has advantages over JSON (more
readable, comments).   It would also be the easiest to transition from the
current scheme, since the current java system properties can be seamlessly integrated.

I would be happy to contribute this back soon because it is also a big pain point for us.  I also have extensive experience with both Typesafe Config and other config systems.

I would definitely start with SparkContext and work our way out from there.
   In fact I can submit a patch for everyone to test out fairly quickly just for SparkContext.

-Evan



On Tue, Sep 24, 2013 at 10:26 PM, Shane Huang <sh...@gmail.com>wrote:

> I think it's good to have Bigtop to package Spark. But in this track 
> we're just targeting enhancing the usability of Spark itself without Bigtop.
>  After all, few of our customers used Bigtop.
>
>
> On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <shannie.huang@gmail.com
> >wrote:
>
> > I think it's good to have Bigtop to package Spark. But I
> >
> >
> > On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <cos@apache.org
> >wrote:
> >
> >> Late to the game, but... Bigtop is packaging Spark now as a part of 
> >> the standard distribution - our release 0.7.0 is around the corner. 
> >> And we
> do
> >> it
> >> in the same way that has been done for Hadoop. Perhaps it worth 
> >> looking into...
> >>
> >> Lemme know if you have any questions,
> >>   Cos
> >>
> >> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> >> > And I created a new issue SPARK-915 to track the re-org of 
> >> > scripts as
> >> > SPARK-544 only talks about Config.
> >> > https://spark-project.atlassian.net/browse/SPARK-915
> >> >
> >> >
> >> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <
> matei.zaharia@gmail.com
> >> >wrote:
> >> >
> >> > > Hi Shane,
> >> > >
> >> > > I agree with all these points. Improving the configuration 
> >> > > system is
> >> one
> >> > > of the main things I'd like to have in the next release.
> >> > >
> >> > > > 1) Usually the application developers/users and platform
> >> administrators
> >> > > > belongs to two teams. So it's better to separate the scripts 
> >> > > > used
> by
> >> > > > administrators and application users, e.g. put them in sbin 
> >> > > > and
> bin
> >> > > folders
> >> > > > respectively
> >> > >
> >> > > Yup, right now we don't have any attempt to install on standard
> system
> >> > > paths.
> >> > >
> >> > > > 3) If there are multiple ways to specify an option, an 
> >> > > > overriding
> >> rule
> >> > > > should be present and should not be error-prone.
> >> > >
> >> > > Yes, I think this should always be Configuration class in code 
> >> > > >
> >> system
> >> > > properties > env vars. Over time we will deprecate the env vars 
> >> > > and
> >> maybe
> >> > > even system properties.
> >> > >
> >> > > > 4) Currently the options are set and get using System property.
> >> It's hard
> >> > > > to manage and inconvenient for users. It's good to gather the
> >> options
> >> > > into
> >> > > > one file using format like xml or json.
> >> > >
> >> > > I think this is the main thing to do first -- pick one 
> >> > > configuration
> >> class
> >> > > and change the code to use this.
> >> > >
> >> > > > Our rough proposal:
> >> > > >
> >> > > >   - Scripts
> >> > > >
> >> > > >   1. make an "sbin" folder containing all the scripts for
> >> administrators,
> >> > > >   specifically,
> >> > > >      - all service administration scripts, i.e. start-*, stop-*,
> >> > > >      slaves.sh, *-daemons, *-daemon scripts
> >> > > >      - low-level or internally used utility scripts, i.e.
> >> > > >      compute-classpath, spark-config, spark-class, spark-executor
> >> > > >   2. make a "bin" folder containing all the scripts for
> application
> >> > > >   developers/users, specifically,
> >> > > >      - user level app  running scripts, i.e. pyspark, 
> >> > > > spark-shell,
> >> and we
> >> > > >      propose to add a script "spark" for users to run 
> >> > > > applications
> >> (very
> >> > > much
> >> > > >      like spark-class but may add some more control or 
> >> > > > convenient
> >> > > utilities)
> >> > > >      - scripts for status checking, e.g. spark and hadoop version
> >> > > >      checking, running applications checking, etc. We can 
> >> > > > make
> this
> >> a
> >> > > separate
> >> > > >      script or add functionality to "spark" script.
> >> > > >   3. No wandering scripts outside the sbin and bin folders
> >> > >
> >> > > Makes sense.
> >> > >
> >> > > >   -  Configurations/Options and overriding rule
> >> > > >
> >> > > >   1. Define a Configuration class which contains all the 
> >> > > > options
> >> > > available
> >> > > >   for Spark application. A Configuration instance can be
> >> de-/serialized
> >> > > >   from/to a json formatted file.
> >> > > >   2. Each application (SparkContext) has one Configuration
> instance
> >> and
> >> > > it
> >> > > >   is initialized by the application which creates it (either 
> >> > > > read
> >> from
> >> > > file
> >> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> >> > > >   3. When launching an Executor on a node, the Configuration 
> >> > > > is
> >> firstly
> >> > > >   initialized using the node-local configuration file as default.
> >> The
> >> > > >   Configuration passed from application driver context will
> >> override any
> >> > > >   options specified in default.
> >> > >
> >> > > This sounds great to me! The one thing I'll add is that we 
> >> > > might
> want
> >> to
> >> > > prevent applications from overriding certain settings on each 
> >> > > node,
> >> such as
> >> > > work directories. The best way is to probably just ignore the 
> >> > > app's
> >> version
> >> > > of those settings in the Executor.
> >> > >
> >> > > If you guys would like, feel free to write up this design on
> >> SPARK-544 and
> >> > > start working on it. I think it looks good.
> >> > >
> >> > > Matei
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > *Shane Huang *
> >> > *Intel Asia-Pacific R&D Ltd.*
> >> > *Email: shengsheng.huang@intel.com*
> >>
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
> >
>
>
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*
>



--
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: Propose to Re-organize the scripts and configurations

Posted by Evan Chan <ev...@ooyala.com>.

Shane, and others,

Let's work together on the configuration thing.   I had proposed in a
separate thread to use Typesafe Config to hold all configuration
(essentially a configuration class, but which can read from both JSON files
as well as -D java command line args).

Typesafe Config works much much better than a simple config class, and also
better than Hadoop configs.  It also has advantages over JSON (more
readable, comments).   It would also be the easiest to transition from the
current scheme, since the current java system properties can be seamlessly
integrated.

I would be happy to contribute this back soon because it is also a big pain
point for us.  I also have extensive experience with both Typesafe Config
and other config systems.

I would definitely start with SparkContext and work our way out from there.
   In fact I can submit a patch for everyone to test out fairly quickly
just for SparkContext.

-Evan



On Tue, Sep 24, 2013 at 10:26 PM, Shane Huang <sh...@gmail.com>wrote:

> I think it's good to have Bigtop to package Spark. But in this track we're
> just targeting enhancing the usability of Spark itself without Bigtop.
>  After all, few of our customers used Bigtop.
>
>
> On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <shannie.huang@gmail.com
> >wrote:
>
> > I think it's good to have Bigtop to package Spark. But I
> >
> >
> > On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <cos@apache.org
> >wrote:
> >
> >> Late to the game, but... Bigtop is packaging Spark now as a part of the
> >> standard distribution - our release 0.7.0 is around the corner. And we
> do
> >> it
> >> in the same way that has been done for Hadoop. Perhaps it worth looking
> >> into...
> >>
> >> Lemme know if you have any questions,
> >>   Cos
> >>
> >> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> >> > And I created a new issue SPARK-915 to track the re-org of scripts as
> >> > SPARK-544 only talks about Config.
> >> > https://spark-project.atlassian.net/browse/SPARK-915
> >> >
> >> >
> >> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <
> matei.zaharia@gmail.com
> >> >wrote:
> >> >
> >> > > Hi Shane,
> >> > >
> >> > > I agree with all these points. Improving the configuration system is
> >> one
> >> > > of the main things I'd like to have in the next release.
> >> > >
> >> > > > 1) Usually the application developers/users and platform
> >> administrators
> >> > > > belongs to two teams. So it's better to separate the scripts used
> by
> >> > > > administrators and application users, e.g. put them in sbin and
> bin
> >> > > folders
> >> > > > respectively
> >> > >
> >> > > Yup, right now we don't have any attempt to install on standard
> system
> >> > > paths.
> >> > >
> >> > > > 3) If there are multiple ways to specify an option, an overriding
> >> rule
> >> > > > should be present and should not be error-prone.
> >> > >
> >> > > Yes, I think this should always be Configuration class in code >
> >> system
> >> > > properties > env vars. Over time we will deprecate the env vars and
> >> maybe
> >> > > even system properties.
> >> > >
> >> > > > 4) Currently the options are set and get using System property.
> >> It's hard
> >> > > > to manage and inconvenient for users. It's good to gather the
> >> options
> >> > > into
> >> > > > one file using format like xml or json.
> >> > >
> >> > > I think this is the main thing to do first -- pick one configuration
> >> class
> >> > > and change the code to use this.
> >> > >
> >> > > > Our rough proposal:
> >> > > >
> >> > > >   - Scripts
> >> > > >
> >> > > >   1. make an "sbin" folder containing all the scripts for
> >> administrators,
> >> > > >   specifically,
> >> > > >      - all service administration scripts, i.e. start-*, stop-*,
> >> > > >      slaves.sh, *-daemons, *-daemon scripts
> >> > > >      - low-level or internally used utility scripts, i.e.
> >> > > >      compute-classpath, spark-config, spark-class, spark-executor
> >> > > >   2. make a "bin" folder containing all the scripts for
> application
> >> > > >   developers/users, specifically,
> >> > > >      - user level app  running scripts, i.e. pyspark, spark-shell,
> >> and we
> >> > > >      propose to add a script "spark" for users to run applications
> >> (very
> >> > > much
> >> > > >      like spark-class but may add some more control or convenient
> >> > > utilities)
> >> > > >      - scripts for status checking, e.g. spark and hadoop version
> >> > > >      checking, running applications checking, etc. We can make
> this
> >> a
> >> > > separate
> >> > > >      script or add functionality to "spark" script.
> >> > > >   3. No wandering scripts outside the sbin and bin folders
> >> > >
> >> > > Makes sense.
> >> > >
> >> > > >   -  Configurations/Options and overriding rule
> >> > > >
> >> > > >   1. Define a Configuration class which contains all the options
> >> > > available
> >> > > >   for Spark application. A Configuration instance can be
> >> de-/serialized
> >> > > >   from/to a json formatted file.
> >> > > >   2. Each application (SparkContext) has one Configuration
> instance
> >> and
> >> > > it
> >> > > >   is initialized by the application which creates it (either read
> >> from
> >> > > file
> >> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> >> > > >   3. When launching an Executor on a node, the Configuration is
> >> firstly
> >> > > >   initialized using the node-local configuration file as default.
> >> The
> >> > > >   Configuration passed from application driver context will
> >> override any
> >> > > >   options specified in default.
> >> > >
> >> > > This sounds great to me! The one thing I'll add is that we might
> want
> >> to
> >> > > prevent applications from overriding certain settings on each node,
> >> such as
> >> > > work directories. The best way is to probably just ignore the app's
> >> version
> >> > > of those settings in the Executor.
> >> > >
> >> > > If you guys would like, feel free to write up this design on
> >> SPARK-544 and
> >> > > start working on it. I think it looks good.
> >> > >
> >> > > Matei
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > *Shane Huang *
> >> > *Intel Asia-Pacific R&D Ltd.*
> >> > *Email: shengsheng.huang@intel.com*
> >>
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
> >
>
>
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*
>



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: Propose to Re-organize the scripts and configurations

Posted by Konstantin Boudnik <co...@apache.org>.

On Wed, Sep 25, 2013 at 01:26PM, Shane Huang wrote:
> I think it's good to have Bigtop to package Spark. But in this track we're
> just targeting enhancing the usability of Spark itself without Bigtop.
>  After all, few of our customers used Bigtop.

Bigtop - in this particular application - is just a way of building
packages.

But what I am saying is that Bigtop has this structure in place already - it
can be just copied.

Cos

> On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <sh...@gmail.com>wrote:
> 
> > I think it's good to have Bigtop to package Spark. But I
> >
> >
> > On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <co...@apache.org>wrote:
> >
> >> Late to the game, but... Bigtop is packaging Spark now as a part of the
> >> standard distribution - our release 0.7.0 is around the corner. And we do
> >> it
> >> in the same way that has been done for Hadoop. Perhaps it worth looking
> >> into...
> >>
> >> Lemme know if you have any questions,
> >>   Cos
> >>
> >> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> >> > And I created a new issue SPARK-915 to track the re-org of scripts as
> >> > SPARK-544 only talks about Config.
> >> > https://spark-project.atlassian.net/browse/SPARK-915
> >> >
> >> >
> >> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <matei.zaharia@gmail.com
> >> >wrote:
> >> >
> >> > > Hi Shane,
> >> > >
> >> > > I agree with all these points. Improving the configuration system is
> >> one
> >> > > of the main things I'd like to have in the next release.
> >> > >
> >> > > > 1) Usually the application developers/users and platform
> >> administrators
> >> > > > belongs to two teams. So it's better to separate the scripts used by
> >> > > > administrators and application users, e.g. put them in sbin and bin
> >> > > folders
> >> > > > respectively
> >> > >
> >> > > Yup, right now we don't have any attempt to install on standard system
> >> > > paths.
> >> > >
> >> > > > 3) If there are multiple ways to specify an option, an overriding
> >> rule
> >> > > > should be present and should not be error-prone.
> >> > >
> >> > > Yes, I think this should always be Configuration class in code >
> >> system
> >> > > properties > env vars. Over time we will deprecate the env vars and
> >> maybe
> >> > > even system properties.
> >> > >
> >> > > > 4) Currently the options are set and get using System property.
> >> It's hard
> >> > > > to manage and inconvenient for users. It's good to gather the
> >> options
> >> > > into
> >> > > > one file using format like xml or json.
> >> > >
> >> > > I think this is the main thing to do first -- pick one configuration
> >> class
> >> > > and change the code to use this.
> >> > >
> >> > > > Our rough proposal:
> >> > > >
> >> > > >   - Scripts
> >> > > >
> >> > > >   1. make an "sbin" folder containing all the scripts for
> >> administrators,
> >> > > >   specifically,
> >> > > >      - all service administration scripts, i.e. start-*, stop-*,
> >> > > >      slaves.sh, *-daemons, *-daemon scripts
> >> > > >      - low-level or internally used utility scripts, i.e.
> >> > > >      compute-classpath, spark-config, spark-class, spark-executor
> >> > > >   2. make a "bin" folder containing all the scripts for application
> >> > > >   developers/users, specifically,
> >> > > >      - user level app  running scripts, i.e. pyspark, spark-shell,
> >> and we
> >> > > >      propose to add a script "spark" for users to run applications
> >> (very
> >> > > much
> >> > > >      like spark-class but may add some more control or convenient
> >> > > utilities)
> >> > > >      - scripts for status checking, e.g. spark and hadoop version
> >> > > >      checking, running applications checking, etc. We can make this
> >> a
> >> > > separate
> >> > > >      script or add functionality to "spark" script.
> >> > > >   3. No wandering scripts outside the sbin and bin folders
> >> > >
> >> > > Makes sense.
> >> > >
> >> > > >   -  Configurations/Options and overriding rule
> >> > > >
> >> > > >   1. Define a Configuration class which contains all the options
> >> > > available
> >> > > >   for Spark application. A Configuration instance can be
> >> de-/serialized
> >> > > >   from/to a json formatted file.
> >> > > >   2. Each application (SparkContext) has one Configuration instance
> >> and
> >> > > it
> >> > > >   is initialized by the application which creates it (either read
> >> from
> >> > > file
> >> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> >> > > >   3. When launching an Executor on a node, the Configuration is
> >> firstly
> >> > > >   initialized using the node-local configuration file as default.
> >> The
> >> > > >   Configuration passed from application driver context will
> >> override any
> >> > > >   options specified in default.
> >> > >
> >> > > This sounds great to me! The one thing I'll add is that we might want
> >> to
> >> > > prevent applications from overriding certain settings on each node,
> >> such as
> >> > > work directories. The best way is to probably just ignore the app's
> >> version
> >> > > of those settings in the Executor.
> >> > >
> >> > > If you guys would like, feel free to write up this design on
> >> SPARK-544 and
> >> > > start working on it. I think it looks good.
> >> > >
> >> > > Matei
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > *Shane Huang *
> >> > *Intel Asia-Pacific R&D Ltd.*
> >> > *Email: shengsheng.huang@intel.com*
> >>
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
> >
> >
> 
> 
> -- 
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

I think it's good to have Bigtop to package Spark. But in this track we're
just targeting enhancing the usability of Spark itself without Bigtop.
 After all, few of our customers used Bigtop.


On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang <sh...@gmail.com>wrote:

> I think it's good to have Bigtop to package Spark. But I
>
>
> On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <co...@apache.org>wrote:
>
>> Late to the game, but... Bigtop is packaging Spark now as a part of the
>> standard distribution - our release 0.7.0 is around the corner. And we do
>> it
>> in the same way that has been done for Hadoop. Perhaps it worth looking
>> into...
>>
>> Lemme know if you have any questions,
>>   Cos
>>
>> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
>> > And I created a new issue SPARK-915 to track the re-org of scripts as
>> > SPARK-544 only talks about Config.
>> > https://spark-project.atlassian.net/browse/SPARK-915
>> >
>> >
>> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <matei.zaharia@gmail.com
>> >wrote:
>> >
>> > > Hi Shane,
>> > >
>> > > I agree with all these points. Improving the configuration system is
>> one
>> > > of the main things I'd like to have in the next release.
>> > >
>> > > > 1) Usually the application developers/users and platform
>> administrators
>> > > > belongs to two teams. So it's better to separate the scripts used by
>> > > > administrators and application users, e.g. put them in sbin and bin
>> > > folders
>> > > > respectively
>> > >
>> > > Yup, right now we don't have any attempt to install on standard system
>> > > paths.
>> > >
>> > > > 3) If there are multiple ways to specify an option, an overriding
>> rule
>> > > > should be present and should not be error-prone.
>> > >
>> > > Yes, I think this should always be Configuration class in code >
>> system
>> > > properties > env vars. Over time we will deprecate the env vars and
>> maybe
>> > > even system properties.
>> > >
>> > > > 4) Currently the options are set and get using System property.
>> It's hard
>> > > > to manage and inconvenient for users. It's good to gather the
>> options
>> > > into
>> > > > one file using format like xml or json.
>> > >
>> > > I think this is the main thing to do first -- pick one configuration
>> class
>> > > and change the code to use this.
>> > >
>> > > > Our rough proposal:
>> > > >
>> > > >   - Scripts
>> > > >
>> > > >   1. make an "sbin" folder containing all the scripts for
>> administrators,
>> > > >   specifically,
>> > > >      - all service administration scripts, i.e. start-*, stop-*,
>> > > >      slaves.sh, *-daemons, *-daemon scripts
>> > > >      - low-level or internally used utility scripts, i.e.
>> > > >      compute-classpath, spark-config, spark-class, spark-executor
>> > > >   2. make a "bin" folder containing all the scripts for application
>> > > >   developers/users, specifically,
>> > > >      - user level app  running scripts, i.e. pyspark, spark-shell,
>> and we
>> > > >      propose to add a script "spark" for users to run applications
>> (very
>> > > much
>> > > >      like spark-class but may add some more control or convenient
>> > > utilities)
>> > > >      - scripts for status checking, e.g. spark and hadoop version
>> > > >      checking, running applications checking, etc. We can make this
>> a
>> > > separate
>> > > >      script or add functionality to "spark" script.
>> > > >   3. No wandering scripts outside the sbin and bin folders
>> > >
>> > > Makes sense.
>> > >
>> > > >   -  Configurations/Options and overriding rule
>> > > >
>> > > >   1. Define a Configuration class which contains all the options
>> > > available
>> > > >   for Spark application. A Configuration instance can be
>> de-/serialized
>> > > >   from/to a json formatted file.
>> > > >   2. Each application (SparkContext) has one Configuration instance
>> and
>> > > it
>> > > >   is initialized by the application which creates it (either read
>> from
>> > > file
>> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
>> > > >   3. When launching an Executor on a node, the Configuration is
>> firstly
>> > > >   initialized using the node-local configuration file as default.
>> The
>> > > >   Configuration passed from application driver context will
>> override any
>> > > >   options specified in default.
>> > >
>> > > This sounds great to me! The one thing I'll add is that we might want
>> to
>> > > prevent applications from overriding certain settings on each node,
>> such as
>> > > work directories. The best way is to probably just ignore the app's
>> version
>> > > of those settings in the Executor.
>> > >
>> > > If you guys would like, feel free to write up this design on
>> SPARK-544 and
>> > > start working on it. I think it looks good.
>> > >
>> > > Matei
>> >
>> >
>> >
>> >
>> > --
>> > *Shane Huang *
>> > *Intel Asia-Pacific R&D Ltd.*
>> > *Email: shengsheng.huang@intel.com*
>>
>
>
>
> --
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*
>
>


-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

I think it's good to have Bigtop to package Spark. But I


On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Late to the game, but... Bigtop is packaging Spark now as a part of the
> standard distribution - our release 0.7.0 is around the corner. And we do
> it
> in the same way that has been done for Hadoop. Perhaps it worth looking
> into...
>
> Lemme know if you have any questions,
>   Cos
>
> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> > And I created a new issue SPARK-915 to track the re-org of scripts as
> > SPARK-544 only talks about Config.
> > https://spark-project.atlassian.net/browse/SPARK-915
> >
> >
> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <matei.zaharia@gmail.com
> >wrote:
> >
> > > Hi Shane,
> > >
> > > I agree with all these points. Improving the configuration system is
> one
> > > of the main things I'd like to have in the next release.
> > >
> > > > 1) Usually the application developers/users and platform
> administrators
> > > > belongs to two teams. So it's better to separate the scripts used by
> > > > administrators and application users, e.g. put them in sbin and bin
> > > folders
> > > > respectively
> > >
> > > Yup, right now we don't have any attempt to install on standard system
> > > paths.
> > >
> > > > 3) If there are multiple ways to specify an option, an overriding
> rule
> > > > should be present and should not be error-prone.
> > >
> > > Yes, I think this should always be Configuration class in code > system
> > > properties > env vars. Over time we will deprecate the env vars and
> maybe
> > > even system properties.
> > >
> > > > 4) Currently the options are set and get using System property. It's
> hard
> > > > to manage and inconvenient for users. It's good to gather the options
> > > into
> > > > one file using format like xml or json.
> > >
> > > I think this is the main thing to do first -- pick one configuration
> class
> > > and change the code to use this.
> > >
> > > > Our rough proposal:
> > > >
> > > >   - Scripts
> > > >
> > > >   1. make an "sbin" folder containing all the scripts for
> administrators,
> > > >   specifically,
> > > >      - all service administration scripts, i.e. start-*, stop-*,
> > > >      slaves.sh, *-daemons, *-daemon scripts
> > > >      - low-level or internally used utility scripts, i.e.
> > > >      compute-classpath, spark-config, spark-class, spark-executor
> > > >   2. make a "bin" folder containing all the scripts for application
> > > >   developers/users, specifically,
> > > >      - user level app  running scripts, i.e. pyspark, spark-shell,
> and we
> > > >      propose to add a script "spark" for users to run applications
> (very
> > > much
> > > >      like spark-class but may add some more control or convenient
> > > utilities)
> > > >      - scripts for status checking, e.g. spark and hadoop version
> > > >      checking, running applications checking, etc. We can make this a
> > > separate
> > > >      script or add functionality to "spark" script.
> > > >   3. No wandering scripts outside the sbin and bin folders
> > >
> > > Makes sense.
> > >
> > > >   -  Configurations/Options and overriding rule
> > > >
> > > >   1. Define a Configuration class which contains all the options
> > > available
> > > >   for Spark application. A Configuration instance can be
> de-/serialized
> > > >   from/to a json formatted file.
> > > >   2. Each application (SparkContext) has one Configuration instance
> and
> > > it
> > > >   is initialized by the application which creates it (either read
> from
> > > file
> > > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > > >   3. When launching an Executor on a node, the Configuration is
> firstly
> > > >   initialized using the node-local configuration file as default. The
> > > >   Configuration passed from application driver context will override
> any
> > > >   options specified in default.
> > >
> > > This sounds great to me! The one thing I'll add is that we might want
> to
> > > prevent applications from overriding certain settings on each node,
> such as
> > > work directories. The best way is to probably just ignore the app's
> version
> > > of those settings in the Executor.
> > >
> > > If you guys would like, feel free to write up this design on SPARK-544
> and
> > > start working on it. I think it looks good.
> > >
> > > Matei
> >
> >
> >
> >
> > --
> > *Shane Huang *
> > *Intel Asia-Pacific R&D Ltd.*
> > *Email: shengsheng.huang@intel.com*
>



-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Konstantin Boudnik <co...@apache.org>.

Late to the game, but... Bigtop is packaging Spark now as a part of the
standard distribution - our release 0.7.0 is around the corner. And we do it
in the same way that has been done for Hadoop. Perhaps it worth looking
into...

Lemme know if you have any questions,
  Cos

On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> And I created a new issue SPARK-915 to track the re-org of scripts as
> SPARK-544 only talks about Config.
> https://spark-project.atlassian.net/browse/SPARK-915
> 
> 
> On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:
> 
> > Hi Shane,
> >
> > I agree with all these points. Improving the configuration system is one
> > of the main things I'd like to have in the next release.
> >
> > > 1) Usually the application developers/users and platform administrators
> > > belongs to two teams. So it's better to separate the scripts used by
> > > administrators and application users, e.g. put them in sbin and bin
> > folders
> > > respectively
> >
> > Yup, right now we don't have any attempt to install on standard system
> > paths.
> >
> > > 3) If there are multiple ways to specify an option, an overriding rule
> > > should be present and should not be error-prone.
> >
> > Yes, I think this should always be Configuration class in code > system
> > properties > env vars. Over time we will deprecate the env vars and maybe
> > even system properties.
> >
> > > 4) Currently the options are set and get using System property. It's hard
> > > to manage and inconvenient for users. It's good to gather the options
> > into
> > > one file using format like xml or json.
> >
> > I think this is the main thing to do first -- pick one configuration class
> > and change the code to use this.
> >
> > > Our rough proposal:
> > >
> > >   - Scripts
> > >
> > >   1. make an "sbin" folder containing all the scripts for administrators,
> > >   specifically,
> > >      - all service administration scripts, i.e. start-*, stop-*,
> > >      slaves.sh, *-daemons, *-daemon scripts
> > >      - low-level or internally used utility scripts, i.e.
> > >      compute-classpath, spark-config, spark-class, spark-executor
> > >   2. make a "bin" folder containing all the scripts for application
> > >   developers/users, specifically,
> > >      - user level app  running scripts, i.e. pyspark, spark-shell, and we
> > >      propose to add a script "spark" for users to run applications (very
> > much
> > >      like spark-class but may add some more control or convenient
> > utilities)
> > >      - scripts for status checking, e.g. spark and hadoop version
> > >      checking, running applications checking, etc. We can make this a
> > separate
> > >      script or add functionality to "spark" script.
> > >   3. No wandering scripts outside the sbin and bin folders
> >
> > Makes sense.
> >
> > >   -  Configurations/Options and overriding rule
> > >
> > >   1. Define a Configuration class which contains all the options
> > available
> > >   for Spark application. A Configuration instance can be de-/serialized
> > >   from/to a json formatted file.
> > >   2. Each application (SparkContext) has one Configuration instance and
> > it
> > >   is initialized by the application which creates it (either read from
> > file
> > >   or passed from command line options or env SPARK_JAVA_OPTS).
> > >   3. When launching an Executor on a node, the Configuration is firstly
> > >   initialized using the node-local configuration file as default. The
> > >   Configuration passed from application driver context will override any
> > >   options specified in default.
> >
> > This sounds great to me! The one thing I'll add is that we might want to
> > prevent applications from overriding certain settings on each node, such as
> > work directories. The best way is to probably just ignore the app's version
> > of those settings in the Executor.
> >
> > If you guys would like, feel free to write up this design on SPARK-544 and
> > start working on it. I think it looks good.
> >
> > Matei
> 
> 
> 
> 
> -- 
> *Shane Huang *
> *Intel Asia-Pacific R&D Ltd.*
> *Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Shane Huang <sh...@gmail.com>.

And I created a new issue SPARK-915 to track the re-org of scripts as
SPARK-544 only talks about Config.
https://spark-project.atlassian.net/browse/SPARK-915


On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Hi Shane,
>
> I agree with all these points. Improving the configuration system is one
> of the main things I'd like to have in the next release.
>
> > 1) Usually the application developers/users and platform administrators
> > belongs to two teams. So it's better to separate the scripts used by
> > administrators and application users, e.g. put them in sbin and bin
> folders
> > respectively
>
> Yup, right now we don't have any attempt to install on standard system
> paths.
>
> > 3) If there are multiple ways to specify an option, an overriding rule
> > should be present and should not be error-prone.
>
> Yes, I think this should always be Configuration class in code > system
> properties > env vars. Over time we will deprecate the env vars and maybe
> even system properties.
>
> > 4) Currently the options are set and get using System property. It's hard
> > to manage and inconvenient for users. It's good to gather the options
> into
> > one file using format like xml or json.
>
> I think this is the main thing to do first -- pick one configuration class
> and change the code to use this.
>
> > Our rough proposal:
> >
> >   - Scripts
> >
> >   1. make an "sbin" folder containing all the scripts for administrators,
> >   specifically,
> >      - all service administration scripts, i.e. start-*, stop-*,
> >      slaves.sh, *-daemons, *-daemon scripts
> >      - low-level or internally used utility scripts, i.e.
> >      compute-classpath, spark-config, spark-class, spark-executor
> >   2. make a "bin" folder containing all the scripts for application
> >   developers/users, specifically,
> >      - user level app  running scripts, i.e. pyspark, spark-shell, and we
> >      propose to add a script "spark" for users to run applications (very
> much
> >      like spark-class but may add some more control or convenient
> utilities)
> >      - scripts for status checking, e.g. spark and hadoop version
> >      checking, running applications checking, etc. We can make this a
> separate
> >      script or add functionality to "spark" script.
> >   3. No wandering scripts outside the sbin and bin folders
>
> Makes sense.
>
> >   -  Configurations/Options and overriding rule
> >
> >   1. Define a Configuration class which contains all the options
> available
> >   for Spark application. A Configuration instance can be de-/serialized
> >   from/to a json formatted file.
> >   2. Each application (SparkContext) has one Configuration instance and
> it
> >   is initialized by the application which creates it (either read from
> file
> >   or passed from command line options or env SPARK_JAVA_OPTS).
> >   3. When launching an Executor on a node, the Configuration is firstly
> >   initialized using the node-local configuration file as default. The
> >   Configuration passed from application driver context will override any
> >   options specified in default.
>
> This sounds great to me! The one thing I'll add is that we might want to
> prevent applications from overriding certain settings on each node, such as
> work directories. The best way is to probably just ignore the app's version
> of those settings in the Executor.
>
> If you guys would like, feel free to write up this design on SPARK-544 and
> start working on it. I think it looks good.
>
> Matei




-- 
*Shane Huang *
*Intel Asia-Pacific R&D Ltd.*
*Email: shengsheng.huang@intel.com*

Re: Propose to Re-organize the scripts and configurations

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Shane,

I agree with all these points. Improving the configuration system is one of the main things I'd like to have in the next release.

> 1) Usually the application developers/users and platform administrators
> belongs to two teams. So it's better to separate the scripts used by
> administrators and application users, e.g. put them in sbin and bin folders
> respectively

Yup, right now we don't have any attempt to install on standard system paths.

> 3) If there are multiple ways to specify an option, an overriding rule
> should be present and should not be error-prone.

Yes, I think this should always be Configuration class in code > system properties > env vars. Over time we will deprecate the env vars and maybe even system properties.

> 4) Currently the options are set and get using System property. It's hard
> to manage and inconvenient for users. It's good to gather the options into
> one file using format like xml or json.

I think this is the main thing to do first -- pick one configuration class and change the code to use this.

> Our rough proposal:
> 
>   - Scripts
> 
>   1. make an "sbin" folder containing all the scripts for administrators,
>   specifically,
>      - all service administration scripts, i.e. start-*, stop-*,
>      slaves.sh, *-daemons, *-daemon scripts
>      - low-level or internally used utility scripts, i.e.
>      compute-classpath, spark-config, spark-class, spark-executor
>   2. make a "bin" folder containing all the scripts for application
>   developers/users, specifically,
>      - user level app  running scripts, i.e. pyspark, spark-shell, and we
>      propose to add a script "spark" for users to run applications (very much
>      like spark-class but may add some more control or convenient utilities)
>      - scripts for status checking, e.g. spark and hadoop version
>      checking, running applications checking, etc. We can make this a separate
>      script or add functionality to "spark" script.
>   3. No wandering scripts outside the sbin and bin folders

Makes sense.

>   -  Configurations/Options and overriding rule
> 
>   1. Define a Configuration class which contains all the options available
>   for Spark application. A Configuration instance can be de-/serialized
>   from/to a json formatted file.
>   2. Each application (SparkContext) has one Configuration instance and it
>   is initialized by the application which creates it (either read from file
>   or passed from command line options or env SPARK_JAVA_OPTS).
>   3. When launching an Executor on a node, the Configuration is firstly
>   initialized using the node-local configuration file as default. The
>   Configuration passed from application driver context will override any
>   options specified in default.

This sounds great to me! The one thing I'll add is that we might want to prevent applications from overriding certain settings on each node, such as work directories. The best way is to probably just ignore the app's version of those settings in the Executor.

If you guys would like, feel free to write up this design on SPARK-544 and start working on it. I think it looks good.

Matei

Re: Propose to Re-organize the scripts and configurations

Posted by "shannie.huang" <sh...@gmail.com>.

Yeah, I tend to agree that either executable or not, these common utility scripts may not need to be exposed to end user in sbin and bin folders. But it seems we must still make some of these scripts executable as they are not only called in other scripts, but also called in scala source code.

On 2013-9-17, at 8:35, Mike <sp...@good-with-numbers.com> wrote:

> Shane Huang wrote:
>> - low-level or internally used utility scripts, i.e. 
>> compute-classpath, spark-config, spark-class, spark-executor
> 
> I'd like to see script broken out into shell functions in a common file 
> that gets "."-included in every script, where that makes sense.  
> Specifically, I gather that compute-classpath.sh isn't run except as a 
> subroutine, so no need to promote it as an executable.

Re: Propose to Re-organize the scripts and configurations

Posted by Mike <sp...@good-with-numbers.com>.

Shane Huang wrote:
> - low-level or internally used utility scripts, i.e. 
> compute-classpath, spark-config, spark-class, spark-executor

I'd like to see script broken out into shell functions in a common file 
that gets "."-included in every script, where that makes sense.  
Specifically, I gather that compute-classpath.sh isn't run except as a 
subroutine, so no need to promote it as an executable.