You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Devl Devel <de...@gmail.com> on 2015/02/25 23:13:09 UTC

Some praise and comments on Spark

Hi Spark Developers,

First, apologies if this doesn't belong on this list but the
comments/praise are relevant to all developers. This is just a small note
about what we really like about Spark, I/we don't mean to start a whole
long discussion thread in this forum but just share our positive
experiences with Spark thus far.

To start, as you can tell, we think that the Spark project is amazing and
we love it! Having put in nearly half a decade worth of sweat and tears
into production Hadoop, MapReduce clusters and application development it's
so refreshing to see something arguably simpler and more elegant to
supersede it.

These are the things we love about Spark and hope these principles continue:

-the one command build; make-distribution.sh. Simple, clean  and ideal for
deployment and devops and rebuilding on different environments and nodes.
-not having too much runtime and deploy config; as admins and developers we
are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
and dfs file locations and temp directories and so on and on again and
again every time we deploy a job, new cluster, environment or even change
company.
-a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
so there is no need to add on projects to it on as per Hive, Hue, Hbase
etc. This helps life and keeps everything in one place.
-single (global) user based operation - no creation of a hdfs mapred unix
user, makes life much simpler
-single quick-start daemons; master and slaves. Not having to worry about
JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
of clusters makes life much easier.
-proper code versioning, feature releases and release management.
- good & well organised documentation with good examples.

In addition to the comments above this is where we hope Spark never ends
up:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.
-no more daemons and processes to have to monitor and manipulate and
restart and crash.
-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

Anyway, those are our thoughts for what they are worth, keep up the good
work, we just had to mention it. Again sorry if this is not the right place
or if there is another forum for this stuff.

Cheers

Re: Some praise and comments on Spark

Posted by Nicholas Chammas <ni...@gmail.com>.

Thanks for sharing the feedback about what works well for you!

It's nice to get that; as we all probably know, people generally reach out
only when they have problems.

On Wed, Feb 25, 2015 at 5:38 PM Reynold Xin <rx...@databricks.com> wrote:

> Thanks for the email and encouragement, Devl. Responses to the 3 requests:
>
> -tonnes of configuration properties and "go faster" type flags. For example
> Hadoop and Hbase users will know that there are a whole catalogue of
> properties for regions, caches, network properties, block sizes, etc etc.
> Please don't end up here for example:
> https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
> having to configure all of this and then create a set of properties for
> each environment and then tie this into CI and deployment tools.
>
> As the project grows, it is unavoidable to introduce more config options,
> in particular, we often use config options to test new modules that are
> still experimental before making them the default (e.g. sort-based
> shuffle).
>
> The philosophy here is to make it a very high bar to introduce new config
> options, and make the default values sensible for most deployments, and
> then whenever possible, figure out automatically what is the right setting.
> Note that this in general is hard, but we expect for 99% of the users they
> only need to know a very small number of options (e.g. setting the
> serializer).
>
>
> -no more daemons and processes to have to monitor and manipulate and
> restart and crash.
>
> At the very least you'd need the cluster manager itself to be a daemon
> process because we can't defy the law of physics. But I don't think we want
> to introduce anything beyond that.
>
>
> -a project that penalises developers (that will ultimately help promote
> Spark to their managers and budget holders) with expensive training,
> certification, books and accreditation. Ideally this open source should be
> free, free training= more users = more commercial uptake.
>
> I definitely agree with you on making it easier to learn Spark. We are
> making a lot of materials freely available, including two freely available
> MOOCs on edX:
> https://databricks.com/blog/2014/12/02/announcing-two-
> spark-based-moocs.html
>
>
>
> On Wed, Feb 25, 2015 at 2:13 PM, Devl Devel <de...@gmail.com>
> wrote:
>
> > Hi Spark Developers,
> >
> > First, apologies if this doesn't belong on this list but the
> > comments/praise are relevant to all developers. This is just a small note
> > about what we really like about Spark, I/we don't mean to start a whole
> > long discussion thread in this forum but just share our positive
> > experiences with Spark thus far.
> >
> > To start, as you can tell, we think that the Spark project is amazing and
> > we love it! Having put in nearly half a decade worth of sweat and tears
> > into production Hadoop, MapReduce clusters and application development
> it's
> > so refreshing to see something arguably simpler and more elegant to
> > supersede it.
> >
> > These are the things we love about Spark and hope these principles
> > continue:
> >
> > -the one command build; make-distribution.sh. Simple, clean  and ideal
> for
> > deployment and devops and rebuilding on different environments and nodes.
> > -not having too much runtime and deploy config; as admins and developers
> we
> > are sick of setting props like io.sort and mapred.job.shuffle.merge.
> percent
> > and dfs file locations and temp directories and so on and on again and
> > again every time we deploy a job, new cluster, environment or even change
> > company.
> > -a fully built-in stack, one global project for SQL, dataframes, MLlib
> etc,
> > so there is no need to add on projects to it on as per Hive, Hue, Hbase
> > etc. This helps life and keeps everything in one place.
> > -single (global) user based operation - no creation of a hdfs mapred unix
> > user, makes life much simpler
> > -single quick-start daemons; master and slaves. Not having to worry about
> > JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on
> hundreds
> > of clusters makes life much easier.
> > -proper code versioning, feature releases and release management.
> > - good & well organised documentation with good examples.
> >
> > In addition to the comments above this is where we hope Spark never ends
> > up:
> >
> > -tonnes of configuration properties and "go faster" type flags. For
> example
> > Hadoop and Hbase users will know that there are a whole catalogue of
> > properties for regions, caches, network properties, block sizes, etc etc.
> > Please don't end up here for example:
> > https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
> > having to configure all of this and then create a set of properties for
> > each environment and then tie this into CI and deployment tools.
> > -no more daemons and processes to have to monitor and manipulate and
> > restart and crash.
> > -a project that penalises developers (that will ultimately help promote
> > Spark to their managers and budget holders) with expensive training,
> > certification, books and accreditation. Ideally this open source should
> be
> > free, free training= more users = more commercial uptake.
> >
> > Anyway, those are our thoughts for what they are worth, keep up the good
> > work, we just had to mention it. Again sorry if this is not the right
> place
> > or if there is another forum for this stuff.
> >
> > Cheers
> >
>

Re: Some praise and comments on Spark

Posted by Reynold Xin <rx...@databricks.com>.

Thanks for the email and encouragement, Devl. Responses to the 3 requests:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.

As the project grows, it is unavoidable to introduce more config options,
in particular, we often use config options to test new modules that are
still experimental before making them the default (e.g. sort-based shuffle).

The philosophy here is to make it a very high bar to introduce new config
options, and make the default values sensible for most deployments, and
then whenever possible, figure out automatically what is the right setting.
Note that this in general is hard, but we expect for 99% of the users they
only need to know a very small number of options (e.g. setting the
serializer).


-no more daemons and processes to have to monitor and manipulate and
restart and crash.

At the very least you'd need the cluster manager itself to be a daemon
process because we can't defy the law of physics. But I don't think we want
to introduce anything beyond that.


-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

I definitely agree with you on making it easier to learn Spark. We are
making a lot of materials freely available, including two freely available
MOOCs on edX:
https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html



On Wed, Feb 25, 2015 at 2:13 PM, Devl Devel <de...@gmail.com>
wrote:

> Hi Spark Developers,
>
> First, apologies if this doesn't belong on this list but the
> comments/praise are relevant to all developers. This is just a small note
> about what we really like about Spark, I/we don't mean to start a whole
> long discussion thread in this forum but just share our positive
> experiences with Spark thus far.
>
> To start, as you can tell, we think that the Spark project is amazing and
> we love it! Having put in nearly half a decade worth of sweat and tears
> into production Hadoop, MapReduce clusters and application development it's
> so refreshing to see something arguably simpler and more elegant to
> supersede it.
>
> These are the things we love about Spark and hope these principles
> continue:
>
> -the one command build; make-distribution.sh. Simple, clean  and ideal for
> deployment and devops and rebuilding on different environments and nodes.
> -not having too much runtime and deploy config; as admins and developers we
> are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
> and dfs file locations and temp directories and so on and on again and
> again every time we deploy a job, new cluster, environment or even change
> company.
> -a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
> so there is no need to add on projects to it on as per Hive, Hue, Hbase
> etc. This helps life and keeps everything in one place.
> -single (global) user based operation - no creation of a hdfs mapred unix
> user, makes life much simpler
> -single quick-start daemons; master and slaves. Not having to worry about
> JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
> of clusters makes life much easier.
> -proper code versioning, feature releases and release management.
> - good & well organised documentation with good examples.
>
> In addition to the comments above this is where we hope Spark never ends
> up:
>
> -tonnes of configuration properties and "go faster" type flags. For example
> Hadoop and Hbase users will know that there are a whole catalogue of
> properties for regions, caches, network properties, block sizes, etc etc.
> Please don't end up here for example:
> https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
> having to configure all of this and then create a set of properties for
> each environment and then tie this into CI and deployment tools.
> -no more daemons and processes to have to monitor and manipulate and
> restart and crash.
> -a project that penalises developers (that will ultimately help promote
> Spark to their managers and budget holders) with expensive training,
> certification, books and accreditation. Ideally this open source should be
> free, free training= more users = more commercial uptake.
>
> Anyway, those are our thoughts for what they are worth, keep up the good
> work, we just had to mention it. Again sorry if this is not the right place
> or if there is another forum for this stuff.
>
> Cheers
>