You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Imran Rashid <ir...@cloudera.com> on 2015/04/02 18:57:06 UTC

Re: Spark config option 'expression language' feedback request

IMO, spark's config is kind of a mess right now.  I completely agree with
Reynold that Spark's handling of config ought to be super-simple, its not
the kind of thing we want to put much effort in spark itself.  It sounds so
trivial that everyone wants to redo it, but then all these additional
features start to get thrown in, it starts to get complicated.  This is one
of many reasons our config handling is inadequate.  It would be better if
we could outsource it to other libraries, or even better yet, let users
bring their own.

The biggest problem, in my mind, is that there isn't a definitive,
strongly-typed, modular listing of all the parameters.  This makes it
really hard to put your own thing on top -- you've got to manually go
through all the options and put them into your own config library.  And
then make sure its up-to-date with every new release of spark.

Just as a small example of how the options are hard to track down, some of
the options for event logging are listed in SparkContext:
https://github.com/apache/spark/blob/424e987dfebbbaa37f4496d44090d469a931ce76/core/src/main/scala/org/apache/spark/SparkContext.scala#L229

and some others are listed in EventLoggingListener:
https://github.com/apache/spark/blob/424e987dfebbbaa37f4496d44090d469a931ce76/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L60

this also makes it a headache while developing & trying to keep the
documentation up-to-date.

There are a handful of different libraries that might help out with this:
scopt, argot, scallop, sumac.  I'm biased to sumac [since I wrote it ], but
probably any of these would let me do whatever customizations I wanted on
top, without needing to manually keep every option in sync.  That said, I
do think sumac is especially well suited to the way Spark uses
configuration -- the nested structure directly maps to the way we have
things organized currently.  so eg. everything related to event logging
would get placed in a class like:

class EventLoggingOpts {
  var enabled = false
  var compress = false
  var testing = false
  var overwrite = false
  var buffer: Bytes = 100.kilobytes
}


Another plus is that you get fail-fast behavior -- if you put in some
unparseable value, the job will fail immediately, rather than 1 hour in
when you first try to access the value.

In any case, my main point is just that I think we should try to make our
config more compatible with external config tools, rather than trying to
build own.  And after that, I'd just like to throw Sumac into the ring as a
contender :)


On Fri, Mar 13, 2015 at 1:26 PM, Reynold Xin <rx...@databricks.com> wrote:

> This is an interesting idea.
>
> Are there well known libraries for doing this? Config is the one place
> where it would be great to have something ridiculously simple, so it is
> more or less bug free. I'm concerned about the complexity in this patch and
> subtle bugs that it might introduce to config options that users will have
> no workarounds. Also I believe it is fairly hard for nice error messages to
> propagate when using Scala's parser combinator.
>
>
> On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson <da...@hotmail.com>
> wrote:
>
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line,
> environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a
> system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores:             Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when specifying spark
> > configuration options?
> > Regards,Dale.
> >
>