You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/11/11 00:10:55 UTC

A proposal for Spark 2.0

I’m starting a new thread since the other one got intermixed with feature
requests. Please refrain from making feature request in this thread. Not
that we shouldn’t be adding features, but we can always add features in
1.7, 2.1, 2.2, ...

First - I want to propose a premise for how to think about Spark 2.0 and
major releases in Spark, based on discussion with several members of the
community: a major release should be low overhead and minimally disruptive
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

For this reason, I would *not* propose doing major releases to break
substantial API's or perform large re-architecting that prevent users from
upgrading. Spark has always had a culture of evolving architecture
incrementally and making changes - and I don't think we want to change this
model. In fact, we’ve released many architectural changes on the 1.X line.

If the community likes the above model, then to me it seems reasonable to
do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
major releases every 2 years seems doable within the above model.

Under this model, here is a list of example things I would propose doing in
Spark 2.0, separated into APIs and Operation/Deployment:


APIs

1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
1.x.

2. Remove Akka from Spark’s API dependency (in streaming), so user
applications can use Akka (SPARK-5293). We have gotten a lot of complaints
about user applications being unable to use Akka due to Spark’s dependency
on Akka.

3. Remove Guava from Spark’s public API (JavaRDD Optional).

4. Better class package structure for low level developer API’s. In
particular, we have some DeveloperApi (mostly various listener-related
classes) added over the years. Some packages include only one or two public
classes but a lot of private classes. A better structure is to have public
classes isolated to a few public packages, and these public packages should
have minimal private classes for low level developer APIs.

5. Consolidate task metric and accumulator API. Although having some subtle
differences, these two are very similar but have completely different code
path.

6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
them to other package(s). They are already used beyond SQL, e.g. in ML
pipelines, and will be used by streaming also.


Operation/Deployment

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

2. Remove Hadoop 1 support.

3. Assembly-free distribution of Spark: don’t require building an enormous
assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.

good point about dropping <2.2 for hadoop. you dont want to deal with
protobuf 2.4 for example


On Wed, Nov 11, 2015 at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:

> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
> > to the Spark community. A major release should not be very different
> from a
> > minor release and should not be gated based on new features. The main
> > purpose of a major release is an opportunity to fix things that are
> broken
> > in the current API and remove certain deprecated APIs (examples follow).
>
> Agree with this stance. Generally, a major release might also be a
> time to replace some big old API or implementation with a new one, but
> I don't see obvious candidates.
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
> there's a fairly good reason to continue adding features in 1.x to a
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
> > 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> > it has been end-of-life.
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
> > 2. Remove Hadoop 1 support.
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
> I'm sure we'll think of a number of other small things -- shading a
> bunch of stuff? reviewing and updating dependencies in light of
> simpler, more recent dependencies to support from Hadoop etc?
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> Pop out any Docker stuff to another repo?
> Continue that same effort for EC2?
> Farming out some of the "external" integrations to another repo (?
> controversial)
>
> See also anything marked version "2+" in JIRA.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: RE: A proposal for Spark 2.0

Posted by Guoqiang Li <wi...@qq.com>.

Yes, I agree with  Nan Zhu. I recommend these projects:
https://github.com/dmlc/ps-lite (Apache License 2)
https://github.com/Microsoft/multiverso (MIT License)


Alexander, You may also be interested in the demo(graph on parameter Server) 


https://github.com/witgo/zen/tree/ps_graphx/graphx/src/main/scala/com/github/cloudml/zen/graphx







------------------ Original ------------------
From:  "Ulanov, Alexander";<al...@hpe.com>;
Date:  Fri, Nov 13, 2015 01:44 AM
To:  "Nan Zhu"<zh...@gmail.com>; "Guoqiang Li"<wi...@qq.com>; 
Cc:  "dev@spark.apache.org"<de...@spark.apache.org>; "Reynold Xin"<rx...@databricks.com>; 
Subject:  RE: A proposal for Spark 2.0



  
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”.  At the same time I would be happy to have that feature.
 
 
 
With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine  learning packages seems to be somewhat confusing.
 
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
 
 
 
Best regards, Alexander
 
 
 
From: Nan Zhu [mailto:zhunanmcgill@gmail.com] 
 Sent: Thursday, November 12, 2015 7:28 AM
 To: witgo@qq.com
 Cc: dev@spark.apache.org
 Subject: Re: A proposal for Spark 2.0
 
 
  
Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?
 
  
 
 
  
Best,
 
   
 
 
  
-- 
 
  
Nan Zhu
 
  
http://codingcat.me
 
  
 
 
 
 
On Thursday, November 12, 2015 at 9:49 AM,  witgo@qq.com wrote:
     
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.
 
  
 
 
  
 
 
    
在 2015年11月12日，05:32，Matei  Zaharia <ma...@gmail.com>  写道：
 
  
 
 
  
I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.
 
  
 
 
  
Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many  code changes, just maybe some REPL stuff.
 
  
 
 
  
Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.
 
  
 
 
  
Matei
 
  
 
 
    
On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
 
  
 
 
  
On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
 
    
to the Spark community. A major release should not be very different from a
 
  
minor release and should not be gated based on new features. The main
 
  
purpose of a major release is an opportunity to fix things that are broken
 
  
in the current API and remove certain deprecated APIs (examples follow).
 
 
   
 
 
  
Agree with this stance. Generally, a major release might also be a
 
  
time to replace some big old API or implementation with a new one, but
 
  
I don't see obvious candidates.
 
  
 
 
  
I wouldn't mind turning attention to 2.x sooner than later, unless
 
  
there's a fairly good reason to continue adding features in 1.x to a
 
  
1.7 release. The scope as of 1.6 is already pretty darned big.
 
  
 
 
  
 
 
    
1. Scala 2.11 as the default build. We should still support Scala 2.10, but
 
  
it has been end-of-life.
 
 
   
 
 
  
By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
 
  
be quite stable, and 2.10 will have been EOL for a while. I'd propose
 
  
dropping 2.10. Otherwise it's supported for 2 more years.
 
  
 
 
  
 
 
   
2. Remove Hadoop 1 support.
 
   
 
 
  
I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
 
  
sort of 'alpha' and 'beta' releases) and even <2.6.
 
  
 
 
  
I'm sure we'll think of a number of other small things -- shading a
 
  
bunch of stuff? reviewing and updating dependencies in light of
 
  
simpler, more recent dependencies to support from Hadoop etc?
 
  
 
 
  
Farming out Tachyon to a module? (I felt like someone proposed this?)
 
  
Pop out any Docker stuff to another repo?
 
  
Continue that same effort for EC2?
 
  
Farming out some of the "external" integrations to another repo (?
 
  
controversial)
 
  
 
 
  
See also anything marked version "2+" in JIRA.
 
  
 
 
  
---------------------------------------------------------------------
 
  
To unsubscribe, e-mail:  dev-unsubscribe@spark.apache.org
 
  
For additional commands, e-mail:  dev-help@spark.apache.org
 
 
   
 
 
  
 
 
  
---------------------------------------------------------------------
 
  
To unsubscribe, e-mail:  dev-unsubscribe@spark.apache.org
 
  
For additional commands, e-mail:  dev-help@spark.apache.org
 
 
   
 
 
  
 
 
  
 
 
  
 
 
  
---------------------------------------------------------------------
 
  
To unsubscribe, e-mail:  dev-unsubscribe@spark.apache.org
 
  
For additional commands, e-mail:  dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Steve Loughran <st...@hortonworks.com>.

Producing new x.0 releases of open source projects is a recurrent problem: too radical a change means the old version gets updated anyway (Python 3) and an incompatible version stops takeup (example, Log4Js dropping support for log4j.properties files),

Similarly, any radical new feature does tend to push out release times longer than you think (Hadoop 2).

I think the lessons I'd draw from those and others is: keep an x.0 version as compatible as possible so that everyone can move, and ship fast. You want to be able to retire the 1.x line.

And how to ship fast? Keep those features down.

For anyone planning anything radical —a branch with a clear plan/schedule to be merged in is probably the best strategy. I actually think the firefox process is the best here, and that it should have been adopted more in Hadoop; ongoing work is going in in branches for some things (erasure coding, IPv6), but there's still pressure to define the release schedule on feature completeness.

https://wiki.mozilla.org/Release_Management/Release_Process

see also JDD's article on evolution vs revolution in OSS; 15 years old but still valid. At the time, the Jakarta project was the equivalent of the ASF hadoop/big data stack, and indeed, its traces run through the code and the build & test process if you know what to look for

http://incubator.apache.org/learn/rules-for-revolutionaries.html



-Steve

Re: A proposal for Spark 2.0

Posted by Mridul Muralidharan <mr...@gmail.com>.

There was a proposal to make schedulers pluggable in context of adding one
which leverages Apache Tez : IIRC it was a abandoned - but the jira might
be a good starting point.

Regards
Mridul
On Dec 3, 2015 2:59 PM, "Rad Gruchalski" <ra...@gruchalski.com> wrote:

> There was a talk in this thread about removing the fine-grained Mesos
> scheduler. I think it would a loss to lose it completely, however, I
> understand that it might be a burden to keep it under development for Mesos
> only.
> Having been thinking about it for a while, it would be great if the
> schedulers were pluggable. If Spark 2 could offer a way of registering a
> scheduling mechanism then the Mesos fine-grained scheduler could be moved
> to a separate project and, possibly, maintained by a separate community.
> This would also enable people to add more schedulers in the future -
> Kubernetes comes into mind but also Docker Swarm would become an option.
> This would allow growing the ecosystem a bit.
>
> I’d be very interested in working on such a feature.
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Thursday, 3 December 2015 at 21:28, Koert Kuipers wrote:
>
> spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to
> me you already provide a clear upgrade path: get on scala 2.11 before
> upgrading to spark 2.x
>
> from scala team when scala 2.10.6 came out:
> We strongly encourage you to upgrade to the latest stable version of Scala
> 2.11.x, as the 2.10.x series is no longer actively maintained.
>
>
>
>
>
> On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
> Reynold's post fromNov. 25:
>
> I don't think we should drop support for Scala 2.10, or make it harder in
> terms of operations for people to upgrade.
>
> If there are further objections, I'm going to bump remove the 1.7 version
> and retarget things to 2.0 on JIRA.
>
>
> On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
>
> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> think that's premature. If there's a 1.7.0 then we've lost info about
> what it would contain. It's trivial at any later point to merge the
> versions. And, since things change and there's not a pressing need to
> decide one way or the other, it seems fine to at least collect this
> info like we have things like "1.4.3" that may never be released. I'd
> like to add it back?
>
> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> > is over-stretched now. This means that after 1.6 it's just small
> > maintenance releases in 1.x and no substantial features or evolution.
> > This means that the "in progress" APIs in 1.x that will stay that way,
> > unless one updates to 2.x. It's not unreasonable, but means the update
> > to the 2.x line isn't going to be that optional for users.
> >
> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> > it for a couple years, note. 2.10 is still used today, but that's the
> > point of the current stable 1.x release in general: if you want to
> > stick to current dependencies, stick to the current release. Although
> > I think that's the right way to think about support across major
> > versions in general, I can see that 2.x is more of a required update
> > for those following the project's fixes and releases. Hence may indeed
> > be important to just keep supporting 2.10.
> >
> > I can't see supporting 2.12 at the same time (right?). Is that a
> > concern? it will be long since GA by the time 2.x is first released.
> >
> > There's another fairly coherent worldview where development continues
> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> > 2.0 is delayed somewhat into next year, and by that time supporting
> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> > currently deployed versions.
> >
> > I can't say I have a strong view but I personally hadn't imagined 2.x
> > would start now.
> >
> >
> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >>
> >>
> >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
> >> wrote:
> >>>
> >>> I see.  My concern is / was that cluster operators will be reluctant to
> >>> upgrade to 2.0, meaning that developers using those clusters need to
> stay on
> >>> 1.x, and, if they want to move to DataFrames, essentially need to port
> their
> >>> app twice.
> >>>
> >>> I misunderstood and thought part of the proposal was to drop support
> for
> >>> 2.10 though.  If your broad point is that there aren't changes in 2.0
> that
> >>> will make it less palatable to cluster administrators than releases in
> the
> >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
> >>>
> >>> -Sandy
> >>>
> >>>
> >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <
> matei.zaharia@gmail.com>
> >>> wrote:
> >>>>
> >>>> What are the other breaking changes in 2.0 though? Note that we're not
> >>>> removing Scala 2.10, we're just making the default build be against
> Scala
> >>>> 2.11 instead of 2.10. There seem to be very few changes that people
> would
> >>>> worry about. If people are going to update their apps, I think it's
> better
> >>>> to make the other small changes in 2.0 at the same time than to
> update once
> >>>> for Dataset and another time for 2.0.
> >>>>
> >>>> BTW just refer to Reynold's original post for the other proposed API
> >>>> changes.
> >>>>
> >>>> Matei
> >>>>
> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >>>>
> >>>> I think that Kostas' logic still holds.  The majority of Spark users,
> and
> >>>> likely an even vaster majority of people running vaster jobs, are
> still on
> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably
> want
> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
> they
> >>>> don't need to do so twice.  Requiring that they absorb all the other
> ways
> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> >>>> difficult for them to make this transition.
> >>>>
> >>>> Using the same set of APIs also means that it will be easier to
> backport
> >>>> critical fixes to the 1.x line.
> >>>>
> >>>> It's not clear to me that avoiding breakage of an experimental API in
> the
> >>>> 1.x line outweighs these issues.
> >>>>
> >>>> -Sandy
> >>>>
> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> >>>>> reason is that I already know we have to break some part of the
> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
> DataFrame.map
> >>>>> should return Dataset rather than RDD). In that case, I'd rather
> break this
> >>>>> sooner (in one release) than later (in two releases). so the damage
> is
> >>>>> smaller.
> >>>>>
> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> >>>>> matters too much for 2.0. We can still call Dataset experimental in
> 2.0 and
> >>>>> then mark them as stable in 2.1. Despite being "experimental", there
> has
> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
> mark@clearstorydata.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> >>>>>> fixing.  We're on the same page now.
> >>>>>>
> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
> kostas@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
> APIs in
> >>>>>>> z releases. The Dataset API is experimental and so we might be
> changing the
> >>>>>>> APIs before we declare it stable. This is why I think it is
> important to
> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
> moving to
> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
> Dataset
> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
> incompatible
> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> >>>>>>>
> >>>>>>> Kostas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> >>>>>>> <ma...@clearstorydata.com> wrote:
> >>>>>>>>
> >>>>>>>> Why does stabilization of those two features require a 1.7 release
> >>>>>>>> instead of 1.6.1?
> >>>>>>>>
> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> >>>>>>>> <ko...@cloudera.com> wrote:
> >>>>>>>>>
> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
> yes we
> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0.
> I'd like to
> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
> allow us to
> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> >>>>>>>>>
> >>>>>>>>> 1) the experimental Datasets API
> >>>>>>>>> 2) the new unified memory manager.
> >>>>>>>>>
> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy
> transition
> >>>>>>>>> but there will be users that won't be able to seamlessly upgrade
> given what
> >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a
> 1.x release
> >>>>>>>>> with these new features/APIs stabilized will be very beneficial.
> This might
> >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a
> bad thing.
> >>>>>>>>>
> >>>>>>>>> Any thoughts on this timeline?
> >>>>>>>>>
> >>>>>>>>> Kostas Sakellis
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com
> >
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Agree, more features/apis/optimization need to be added in
> DF/DS.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> >>>>>>>>>> provide to developer, maybe the fundamental API is enough,
> like, the
> >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this
> category, as we
> >>>>>>>>>> can do the same thing easily with DF/DS, even better
> performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> >>>>>>>>>> To: Stephen Boesch
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> >>>>>>>>>> argues for retaining the RDD API but not as the first thing
> presented to new
> >>>>>>>>>> Spark developers: "Here's how to use groupBy with
> DataFrames.... Until the
> >>>>>>>>>> optimizer is more fully developed, that won't always get you
> the best
> >>>>>>>>>> performance that can be obtained.  In these particular
> circumstances, ...,
> >>>>>>>>>> you may want to use the low-level RDD API while setting
> >>>>>>>>>> preservesPartitioning to true.  Like this...."
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <
> javadba@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> My understanding is that  the RDD's presently have more support
> for
> >>>>>>>>>> complete control of partitioning which is a key consideration
> at scale.
> >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it
> would seem
> >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> An example is the use of groupBy when we know that the source
> >>>>>>>>>> relation (/RDD) is already partitioned on the grouping
> expressions.  AFAIK
> >>>>>>>>>> the spark sql still does not allow that knowledge to be applied
> to the
> >>>>>>>>>> optimizer - so a full shuffle will be performed. However in the
> native RDD
> >>>>>>>>>> we can use preservesPartitioning=true.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <
> mark@clearstorydata.com>:
> >>>>>>>>>>
> >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
> >>>>>>>>>> wondering about.  I think it may be going too far to deprecate
> it, but
> >>>>>>>>>> changing emphasis is something that we might consider.  The RDD
> API came
> >>>>>>>>>> well before DataFrames and DataSets, so programming guides,
> introductory
> >>>>>>>>>> how-to articles and the like have, to this point, also tended
> to emphasize
> >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking
> is that with
> >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to
> de-emphasize and
> >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would
> be
> >>>>>>>>>> introduced and fully addressed before RDDs.  They would be
> presented as the
> >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in
> contrast, would
> >>>>>>>>>> be presented later as a kind of lower-level,
> closer-to-the-metal API that
> >>>>>>>>>> can be used in atypical, more specialized contexts where
> DataFrames or
> >>>>>>>>>> DataSets don't fully fit.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <
> hao.cheng@intel.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I am not sure what the best practice for this specific problem,
> but
> >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful
> issue for
> >>>>>>>>>> lots of users.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API
> (or
> >>>>>>>>>> internal API only?)? As lots of its functionality overlapping
> with DataFrame
> >>>>>>>>>> or DataSet.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
> >>>>>>>>>> To: Nicholas Chammas
> >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com;
> dev@spark.apache.org;
> >>>>>>>>>> Reynold Xin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
> hoping
> >>>>>>>>>> that with Spark 2.0 we can also look at better classpath
> isolation with user
> >>>>>>>>>> programs. I propose we build on
> spark.{driver|executor}.userClassPathFirst,
> >>>>>>>>>> setting it true by default, and not allow any spark transitive
> dependencies
> >>>>>>>>>> to leak into user code. For backwards compatibility we can have
> a whitelist
> >>>>>>>>>> if we want but I'd be good if we start requiring user apps to
> explicitly
> >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3
> is also
> >>>>>>>>>> moving in this direction.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Kostas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
> >>>>>>>>>> <ni...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
> >>>>>>>>>> some things in 2.0 without removing or replacing them
> immediately. That way
> >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to
> deprecate to be
> >>>>>>>>>> replaced all at once.
> >>>>>>>>>>
> >>>>>>>>>> Nick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
> >>>>>>>>>> <al...@hpe.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Parameter Server is a new feature and thus does not match the
> goal
> >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and
> remove
> >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to
> have that
> >>>>>>>>>> feature.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best regards, Alexander
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
> >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
> >>>>>>>>>> To: witgo@qq.com
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Being specific to Parameter Server, I think the current
> agreement
> >>>>>>>>>> is that PS shall exist as a third-party library instead of a
> component of
> >>>>>>>>>> the core code base, isn’t?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Nan Zhu
> >>>>>>>>>>
> >>>>>>>>>> http://codingcat.me
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
> >>>>>>>>>>
> >>>>>>>>>> Who has the idea of machine learning? Spark missing some
> features
> >>>>>>>>>> for machine learning, For example, the parameter server.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I like the idea of popping out Tachyon to an optional component
> too
> >>>>>>>>>> to reduce the number of dependencies. In the future, it might
> even be useful
> >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to
> be worth
> >>>>>>>>>> doing now.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Regarding Scala 2.12, we should definitely support it
> eventually,
> >>>>>>>>>> but I don't think we need to block 2.0 on that because it can
> be added later
> >>>>>>>>>> too. Has anyone investigated what it would take to run on
> there? I imagine
> >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
> >>>>>>>>>> releases as undisruptive as possible in the model Reynold
> proposed. Keeping
> >>>>>>>>>> everyone working with the same set of releases is super
> important.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Matei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <
> rxin@databricks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> to the Spark community. A major release should not be very
> >>>>>>>>>> different from a
> >>>>>>>>>>
> >>>>>>>>>> minor release and should not be gated based on new features. The
> >>>>>>>>>> main
> >>>>>>>>>>
> >>>>>>>>>> purpose of a major release is an opportunity to fix things that
> are
> >>>>>>>>>> broken
> >>>>>>>>>>
> >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
> >>>>>>>>>> follow).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Agree with this stance. Generally, a major release might also
> be a
> >>>>>>>>>>
> >>>>>>>>>> time to replace some big old API or implementation with a new
> one,
> >>>>>>>>>> but
> >>>>>>>>>>
> >>>>>>>>>> I don't see obvious candidates.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later,
> unless
> >>>>>>>>>>
> >>>>>>>>>> there's a fairly good reason to continue adding features in 1.x
> to
> >>>>>>>>>> a
> >>>>>>>>>>
> >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support
> Scala
> >>>>>>>>>> 2.10, but
> >>>>>>>>>>
> >>>>>>>>>> it has been end-of-life.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version,
> 2.11
> >>>>>>>>>> will
> >>>>>>>>>>
> >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
> >>>>>>>>>> propose
> >>>>>>>>>>
> >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. Remove Hadoop 1 support.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1
> were
> >>>>>>>>>>
> >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'm sure we'll think of a number of other small things --
> shading a
> >>>>>>>>>>
> >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
> >>>>>>>>>>
> >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
> >>>>>>>>>> this?)
> >>>>>>>>>>
> >>>>>>>>>> Pop out any Docker stuff to another repo?
> >>>>>>>>>>
> >>>>>>>>>> Continue that same effort for EC2?
> >>>>>>>>>>
> >>>>>>>>>> Farming out some of the "external" integrations to another repo
> (?
> >>>>>>>>>>
> >>>>>>>>>> controversial)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> See also anything marked version "2+" in JIRA.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>

Re: A proposal for Spark 2.0

Posted by Rad Gruchalski <ra...@gruchalski.com>.

There was a talk in this thread about removing the fine-grained Mesos scheduler. I think it would a loss to lose it completely, however, I understand that it might be a burden to keep it under development for Mesos only.  
Having been thinking about it for a while, it would be great if the schedulers were pluggable. If Spark 2 could offer a way of registering a scheduling mechanism then the Mesos fine-grained scheduler could be moved to a separate project and, possibly, maintained by a separate community.
This would also enable people to add more schedulers in the future - Kubernetes comes into mind but also Docker Swarm would become an option. This would allow growing the ecosystem a bit.

I’d be very interested in working on such a feature.










Kind regards, 
Radek Gruchalski
 radek@gruchalski.com (mailto:radek@gruchalski.com)  (mailto:radek@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.



On Thursday, 3 December 2015 at 21:28, Koert Kuipers wrote:

> spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to me you already provide a clear upgrade path: get on scala 2.11 before upgrading to spark 2.x
>  
> from scala team when scala 2.10.6 came out:
> We strongly encourage you to upgrade to the latest stable version of Scala 2.11.x, as the 2.10.x series is no longer actively maintained.
>  
>  
>  
>  
>  
> On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <mark@clearstorydata.com (mailto:mark@clearstorydata.com)> wrote:
> > Reynold's post fromNov. 25:
> >  
> > > I don't think we should drop support for Scala 2.10, or make it harder in terms of operations for people to upgrade.
> > >  
> > > If there are further objections, I'm going to bump remove the 1.7 version and retarget things to 2.0 on JIRA.
> >  
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > > Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> > > think that's premature. If there's a 1.7.0 then we've lost info about
> > > what it would contain. It's trivial at any later point to merge the
> > > versions. And, since things change and there's not a pressing need to
> > > decide one way or the other, it seems fine to at least collect this
> > > info like we have things like "1.4.3" that may never be released. I'd
> > > like to add it back?
> > >  
> > > On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > > > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> > > > is over-stretched now. This means that after 1.6 it's just small
> > > > maintenance releases in 1.x and no substantial features or evolution.
> > > > This means that the "in progress" APIs in 1.x that will stay that way,
> > > > unless one updates to 2.x. It's not unreasonable, but means the update
> > > > to the 2.x line isn't going to be that optional for users.
> > > >
> > > > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> > > > it for a couple years, note. 2.10 is still used today, but that's the
> > > > point of the current stable 1.x release in general: if you want to
> > > > stick to current dependencies, stick to the current release. Although
> > > > I think that's the right way to think about support across major
> > > > versions in general, I can see that 2.x is more of a required update
> > > > for those following the project's fixes and releases. Hence may indeed
> > > > be important to just keep supporting 2.10.
> > > >
> > > > I can't see supporting 2.12 at the same time (right?). Is that a
> > > > concern? it will be long since GA by the time 2.x is first released.
> > > >
> > > > There's another fairly coherent worldview where development continues
> > > > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> > > > 2.0 is delayed somewhat into next year, and by that time supporting
> > > > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> > > > currently deployed versions.
> > > >
> > > > I can't say I have a strong view but I personally hadn't imagined 2.x
> > > > would start now.
> > > >
> > > >
> > > > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rxin@databricks.com (mailto:rxin@databricks.com)> wrote:
> > > >> I don't think we should drop support for Scala 2.10, or make it harder in
> > > >> terms of operations for people to upgrade.
> > > >>
> > > >> If there are further objections, I'm going to bump remove the 1.7 version
> > > >> and retarget things to 2.0 on JIRA.
> > > >>
> > > >>
> > > >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)>
> > > >> wrote:
> > > >>>
> > > >>> I see.  My concern is / was that cluster operators will be reluctant to
> > > >>> upgrade to 2.0, meaning that developers using those clusters need to stay on
> > > >>> 1.x, and, if they want to move to DataFrames, essentially need to port their
> > > >>> app twice.
> > > >>>
> > > >>> I misunderstood and thought part of the proposal was to drop support for
> > > >>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> > > >>> will make it less palatable to cluster administrators than releases in the
> > > >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
> > > >>>
> > > >>> -Sandy
> > > >>>
> > > >>>
> > > >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <matei.zaharia@gmail.com (mailto:matei.zaharia@gmail.com)>
> > > >>> wrote:
> > > >>>>
> > > >>>> What are the other breaking changes in 2.0 though? Note that we're not
> > > >>>> removing Scala 2.10, we're just making the default build be against Scala
> > > >>>> 2.11 instead of 2.10. There seem to be very few changes that people would
> > > >>>> worry about. If people are going to update their apps, I think it's better
> > > >>>> to make the other small changes in 2.0 at the same time than to update once
> > > >>>> for Dataset and another time for 2.0.
> > > >>>>
> > > >>>> BTW just refer to Reynold's original post for the other proposed API
> > > >>>> changes.
> > > >>>>
> > > >>>> Matei
> > > >>>>
> > > >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)> wrote:
> > > >>>>
> > > >>>> I think that Kostas' logic still holds.  The majority of Spark users, and
> > > >>>> likely an even vaster majority of people running vaster jobs, are still on
> > > >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
> > > >>>> to upgrade to the stable version of the Dataset / DataFrame API so they
> > > >>>> don't need to do so twice.  Requiring that they absorb all the other ways
> > > >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> > > >>>> difficult for them to make this transition.
> > > >>>>
> > > >>>> Using the same set of APIs also means that it will be easier to backport
> > > >>>> critical fixes to the 1.x line.
> > > >>>>
> > > >>>> It's not clear to me that avoiding breakage of an experimental API in the
> > > >>>> 1.x line outweighs these issues.
> > > >>>>
> > > >>>> -Sandy
> > > >>>>
> > > >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rxin@databricks.com (mailto:rxin@databricks.com)>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> > > >>>>> reason is that I already know we have to break some part of the
> > > >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
> > > >>>>> should return Dataset rather than RDD). In that case, I'd rather break this
> > > >>>>> sooner (in one release) than later (in two releases). so the damage is
> > > >>>>> smaller.
> > > >>>>>
> > > >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> > > >>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> > > >>>>> then mark them as stable in 2.1. Despite being "experimental", there has
> > > >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <mark@clearstorydata.com (mailto:mark@clearstorydata.com)>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> > > >>>>>> fixing.  We're on the same page now.
> > > >>>>>>
> > > >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kostas@cloudera.com (mailto:kostas@cloudera.com)>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
> > > >>>>>>> z releases. The Dataset API is experimental and so we might be changing the
> > > >>>>>>> APIs before we declare it stable. This is why I think it is important to
> > > >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
> > > >>>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
> > > >>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
> > > >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> > > >>>>>>>
> > > >>>>>>> Kostas
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> > > >>>>>>> <mark@clearstorydata.com (mailto:mark@clearstorydata.com)> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Why does stabilization of those two features require a 1.7 release
> > > >>>>>>>> instead of 1.6.1?
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> > > >>>>>>>> <kostas@cloudera.com (mailto:kostas@cloudera.com)> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
> > > >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
> > > >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
> > > >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> > > >>>>>>>>>
> > > >>>>>>>>> 1) the experimental Datasets API
> > > >>>>>>>>> 2) the new unified memory manager.
> > > >>>>>>>>>
> > > >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
> > > >>>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
> > > >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x release
> > > >>>>>>>>> with these new features/APIs stabilized will be very beneficial. This might
> > > >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
> > > >>>>>>>>>
> > > >>>>>>>>> Any thoughts on this timeline?
> > > >>>>>>>>>
> > > >>>>>>>>> Kostas Sakellis
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com (mailto:hao.cheng@intel.com)>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> > > >>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
> > > >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as we
> > > >>>>>>>>>> can do the same thing easily with DF/DS, even better performance.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
> > > >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> > > >>>>>>>>>> To: Stephen Boesch
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org)
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> > > >>>>>>>>>> argues for retaining the RDD API but not as the first thing presented to new
> > > >>>>>>>>>> Spark developers: "Here's how to use groupBy with DataFrames.... Until the
> > > >>>>>>>>>> optimizer is more fully developed, that won't always get you the best
> > > >>>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
> > > >>>>>>>>>> you may want to use the low-level RDD API while setting
> > > >>>>>>>>>> preservesPartitioning to true.  Like this...."
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <javadba@gmail.com (mailto:javadba@gmail.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> My understanding is that  the RDD's presently have more support for
> > > >>>>>>>>>> complete control of partitioning which is a key consideration at scale.
> > > >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
> > > >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> An example is the use of groupBy when we know that the source
> > > >>>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
> > > >>>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
> > > >>>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
> > > >>>>>>>>>> we can use preservesPartitioning=true.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <mark@clearstorydata.com (mailto:mark@clearstorydata.com)>:
> > > >>>>>>>>>>
> > > >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
> > > >>>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
> > > >>>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
> > > >>>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
> > > >>>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
> > > >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that with
> > > >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to de-emphasize and
> > > >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> > > >>>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
> > > >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast, would
> > > >>>>>>>>>> be presented later as a kind of lower-level, closer-to-the-metal API that
> > > >>>>>>>>>> can be used in atypical, more specialized contexts where DataFrames or
> > > >>>>>>>>>> DataSets don't fully fit.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.cheng@intel.com (mailto:hao.cheng@intel.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I am not sure what the best practice for this specific problem, but
> > > >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
> > > >>>>>>>>>> lots of users.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
> > > >>>>>>>>>> internal API only?)? As lots of its functionality overlapping with DataFrame
> > > >>>>>>>>>> or DataSet.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hao
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
> > > >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
> > > >>>>>>>>>> To: Nicholas Chammas
> > > >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com (mailto:witgo@qq.com); dev@spark.apache.org (mailto:dev@spark.apache.org);
> > > >>>>>>>>>> Reynold Xin
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
> > > >>>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with user
> > > >>>>>>>>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> > > >>>>>>>>>> setting it true by default, and not allow any spark transitive dependencies
> > > >>>>>>>>>> to leak into user code. For backwards compatibility we can have a whitelist
> > > >>>>>>>>>> if we want but I'd be good if we start requiring user apps to explicitly
> > > >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> > > >>>>>>>>>> moving in this direction.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Kostas
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
> > > >>>>>>>>>> <nicholas.chammas@gmail.com (mailto:nicholas.chammas@gmail.com)> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to Machine learning, it would be great to move useful
> > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
> > > >>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
> > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
> > > >>>>>>>>>> Tungsten.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
> > > >>>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
> > > >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
> > > >>>>>>>>>> replaced all at once.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Nick
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
> > > >>>>>>>>>> <alexander.ulanov@hpe.com (mailto:alexander.ulanov@hpe.com)> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Parameter Server is a new feature and thus does not match the goal
> > > >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
> > > >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
> > > >>>>>>>>>> feature.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to Machine learning, it would be great to move useful
> > > >>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
> > > >>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
> > > >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
> > > >>>>>>>>>> Tungsten.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best regards, Alexander
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
> > > >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
> > > >>>>>>>>>> To: witgo@qq.com (mailto:witgo@qq.com)
> > > >>>>>>>>>> Cc: dev@spark.apache.org (mailto:dev@spark.apache.org)
> > > >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Being specific to Parameter Server, I think the current agreement
> > > >>>>>>>>>> is that PS shall exist as a third-party library instead of a component of
> > > >>>>>>>>>> the core code base, isn’t?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>>
> > > >>>>>>>>>> Nan Zhu
> > > >>>>>>>>>>
> > > >>>>>>>>>> http://codingcat.me
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com (mailto:witgo@qq.com) wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Who has the idea of machine learning? Spark missing some features
> > > >>>>>>>>>> for machine learning, For example, the parameter server.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaharia@gmail.com (mailto:matei.zaharia@gmail.com)> 写道：
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I like the idea of popping out Tachyon to an optional component too
> > > >>>>>>>>>> to reduce the number of dependencies. In the future, it might even be useful
> > > >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to be worth
> > > >>>>>>>>>> doing now.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
> > > >>>>>>>>>> but I don't think we need to block 2.0 on that because it can be added later
> > > >>>>>>>>>> too. Has anyone investigated what it would take to run on there? I imagine
> > > >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
> > > >>>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
> > > >>>>>>>>>> everyone working with the same set of releases is super important.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Matei
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rxin@databricks.com (mailto:rxin@databricks.com)>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> to the Spark community. A major release should not be very
> > > >>>>>>>>>> different from a
> > > >>>>>>>>>>
> > > >>>>>>>>>> minor release and should not be gated based on new features. The
> > > >>>>>>>>>> main
> > > >>>>>>>>>>
> > > >>>>>>>>>> purpose of a major release is an opportunity to fix things that are
> > > >>>>>>>>>> broken
> > > >>>>>>>>>>
> > > >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
> > > >>>>>>>>>> follow).
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Agree with this stance. Generally, a major release might also be a
> > > >>>>>>>>>>
> > > >>>>>>>>>> time to replace some big old API or implementation with a new one,
> > > >>>>>>>>>> but
> > > >>>>>>>>>>
> > > >>>>>>>>>> I don't see obvious candidates.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
> > > >>>>>>>>>>
> > > >>>>>>>>>> there's a fairly good reason to continue adding features in 1.x to
> > > >>>>>>>>>> a
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
> > > >>>>>>>>>> 2.10, but
> > > >>>>>>>>>>
> > > >>>>>>>>>> it has been end-of-life.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
> > > >>>>>>>>>> will
> > > >>>>>>>>>>
> > > >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
> > > >>>>>>>>>> propose
> > > >>>>>>>>>>
> > > >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. Remove Hadoop 1 support.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > >>>>>>>>>>
> > > >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
> > > >>>>>>>>>>
> > > >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
> > > >>>>>>>>>>
> > > >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
> > > >>>>>>>>>> this?)
> > > >>>>>>>>>>
> > > >>>>>>>>>> Pop out any Docker stuff to another repo?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Continue that same effort for EC2?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Farming out some of the "external" integrations to another repo (?
> > > >>>>>>>>>>
> > > >>>>>>>>>> controversial)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> See also anything marked version "2+" in JIRA.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> ---------------------------------------------------------------------
> > > >>>>>>>>>>
> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >  
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > > For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> > >  
> >  
>

Re: A proposal for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.

spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to
me you already provide a clear upgrade path: get on scala 2.11 before
upgrading to spark 2.x

from scala team when scala 2.10.6 came out:
We strongly encourage you to upgrade to the latest stable version of Scala
2.11.x, as the 2.10.x series is no longer actively maintained.





On Thu, Dec 3, 2015 at 1:03 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Reynold's post fromNov. 25:
>
> I don't think we should drop support for Scala 2.10, or make it harder in
>> terms of operations for people to upgrade.
>>
>> If there are further objections, I'm going to bump remove the 1.7 version
>> and retarget things to 2.0 on JIRA.
>>
>
> On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> think that's premature. If there's a 1.7.0 then we've lost info about
>> what it would contain. It's trivial at any later point to merge the
>> versions. And, since things change and there's not a pressing need to
>> decide one way or the other, it seems fine to at least collect this
>> info like we have things like "1.4.3" that may never be released. I'd
>> like to add it back?
>>
>> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
>> > is over-stretched now. This means that after 1.6 it's just small
>> > maintenance releases in 1.x and no substantial features or evolution.
>> > This means that the "in progress" APIs in 1.x that will stay that way,
>> > unless one updates to 2.x. It's not unreasonable, but means the update
>> > to the 2.x line isn't going to be that optional for users.
>> >
>> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
>> > it for a couple years, note. 2.10 is still used today, but that's the
>> > point of the current stable 1.x release in general: if you want to
>> > stick to current dependencies, stick to the current release. Although
>> > I think that's the right way to think about support across major
>> > versions in general, I can see that 2.x is more of a required update
>> > for those following the project's fixes and releases. Hence may indeed
>> > be important to just keep supporting 2.10.
>> >
>> > I can't see supporting 2.12 at the same time (right?). Is that a
>> > concern? it will be long since GA by the time 2.x is first released.
>> >
>> > There's another fairly coherent worldview where development continues
>> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
>> > 2.0 is delayed somewhat into next year, and by that time supporting
>> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
>> > currently deployed versions.
>> >
>> > I can't say I have a strong view but I personally hadn't imagined 2.x
>> > would start now.
>> >
>> >
>> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> version
>> >> and retarget things to 2.0 on JIRA.
>> >>
>> >>
>> >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> I see.  My concern is / was that cluster operators will be reluctant
>> to
>> >>> upgrade to 2.0, meaning that developers using those clusters need to
>> stay on
>> >>> 1.x, and, if they want to move to DataFrames, essentially need to
>> port their
>> >>> app twice.
>> >>>
>> >>> I misunderstood and thought part of the proposal was to drop support
>> for
>> >>> 2.10 though.  If your broad point is that there aren't changes in 2.0
>> that
>> >>> will make it less palatable to cluster administrators than releases
>> in the
>> >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>> >>>
>> >>> -Sandy
>> >>>
>> >>>
>> >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <
>> matei.zaharia@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> What are the other breaking changes in 2.0 though? Note that we're
>> not
>> >>>> removing Scala 2.10, we're just making the default build be against
>> Scala
>> >>>> 2.11 instead of 2.10. There seem to be very few changes that people
>> would
>> >>>> worry about. If people are going to update their apps, I think it's
>> better
>> >>>> to make the other small changes in 2.0 at the same time than to
>> update once
>> >>>> for Dataset and another time for 2.0.
>> >>>>
>> >>>> BTW just refer to Reynold's original post for the other proposed API
>> >>>> changes.
>> >>>>
>> >>>> Matei
>> >>>>
>> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com>
>> wrote:
>> >>>>
>> >>>> I think that Kostas' logic still holds.  The majority of Spark
>> users, and
>> >>>> likely an even vaster majority of people running vaster jobs, are
>> still on
>> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will
>> probably want
>> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
>> they
>> >>>> don't need to do so twice.  Requiring that they absorb all the other
>> ways
>> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> >>>> difficult for them to make this transition.
>> >>>>
>> >>>> Using the same set of APIs also means that it will be easier to
>> backport
>> >>>> critical fixes to the 1.x line.
>> >>>>
>> >>>> It's not clear to me that avoiding breakage of an experimental API
>> in the
>> >>>> 1.x line outweighs these issues.
>> >>>>
>> >>>> -Sandy
>> >>>>
>> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>> >>>>> reason is that I already know we have to break some part of the
>> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
>> DataFrame.map
>> >>>>> should return Dataset rather than RDD). In that case, I'd rather
>> break this
>> >>>>> sooner (in one release) than later (in two releases). so the damage
>> is
>> >>>>> smaller.
>> >>>>>
>> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
>> >>>>> matters too much for 2.0. We can still call Dataset experimental in
>> 2.0 and
>> >>>>> then mark them as stable in 2.1. Despite being "experimental",
>> there has
>> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
>> mark@clearstorydata.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> >>>>>> fixing.  We're on the same page now.
>> >>>>>>
>> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
>> kostas@cloudera.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
>> APIs in
>> >>>>>>> z releases. The Dataset API is experimental and so we might be
>> changing the
>> >>>>>>> APIs before we declare it stable. This is why I think it is
>> important to
>> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
>> moving to
>> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
>> Dataset
>> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
>> incompatible
>> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>> >>>>>>>
>> >>>>>>> Kostas
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
>> >>>>>>> <ma...@clearstorydata.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Why does stabilization of those two features require a 1.7
>> release
>> >>>>>>>> instead of 1.6.1?
>> >>>>>>>>
>> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
>> >>>>>>>> <ko...@cloudera.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
>> yes we
>> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark
>> 2.0. I'd like to
>> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
>> allow us to
>> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
>> >>>>>>>>>
>> >>>>>>>>> 1) the experimental Datasets API
>> >>>>>>>>> 2) the new unified memory manager.
>> >>>>>>>>>
>> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy
>> transition
>> >>>>>>>>> but there will be users that won't be able to seamlessly
>> upgrade given what
>> >>>>>>>>> we have discussed as in scope for 2.0. For these users, having
>> a 1.x release
>> >>>>>>>>> with these new features/APIs stabilized will be very
>> beneficial. This might
>> >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a
>> bad thing.
>> >>>>>>>>>
>> >>>>>>>>> Any thoughts on this timeline?
>> >>>>>>>>>
>> >>>>>>>>> Kostas Sakellis
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <
>> hao.cheng@intel.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Agree, more features/apis/optimization need to be added in
>> DF/DS.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>> >>>>>>>>>> provide to developer, maybe the fundamental API is enough,
>> like, the
>> >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this
>> category, as we
>> >>>>>>>>>> can do the same thing easily with DF/DS, even better
>> performance.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
>> >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
>> >>>>>>>>>> To: Stephen Boesch
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Cc: dev@spark.apache.org
>> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>> >>>>>>>>>> argues for retaining the RDD API but not as the first thing
>> presented to new
>> >>>>>>>>>> Spark developers: "Here's how to use groupBy with
>> DataFrames.... Until the
>> >>>>>>>>>> optimizer is more fully developed, that won't always get you
>> the best
>> >>>>>>>>>> performance that can be obtained.  In these particular
>> circumstances, ...,
>> >>>>>>>>>> you may want to use the low-level RDD API while setting
>> >>>>>>>>>> preservesPartitioning to true.  Like this...."
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <
>> javadba@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> My understanding is that  the RDD's presently have more
>> support for
>> >>>>>>>>>> complete control of partitioning which is a key consideration
>> at scale.
>> >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it
>> would seem
>> >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> An example is the use of groupBy when we know that the source
>> >>>>>>>>>> relation (/RDD) is already partitioned on the grouping
>> expressions.  AFAIK
>> >>>>>>>>>> the spark sql still does not allow that knowledge to be
>> applied to the
>> >>>>>>>>>> optimizer - so a full shuffle will be performed. However in
>> the native RDD
>> >>>>>>>>>> we can use preservesPartitioning=true.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <
>> mark@clearstorydata.com>:
>> >>>>>>>>>>
>> >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>> >>>>>>>>>> wondering about.  I think it may be going too far to deprecate
>> it, but
>> >>>>>>>>>> changing emphasis is something that we might consider.  The
>> RDD API came
>> >>>>>>>>>> well before DataFrames and DataSets, so programming guides,
>> introductory
>> >>>>>>>>>> how-to articles and the like have, to this point, also tended
>> to emphasize
>> >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm
>> thinking is that with
>> >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to
>> de-emphasize and
>> >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets
>> would be
>> >>>>>>>>>> introduced and fully addressed before RDDs.  They would be
>> presented as the
>> >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in
>> contrast, would
>> >>>>>>>>>> be presented later as a kind of lower-level,
>> closer-to-the-metal API that
>> >>>>>>>>>> can be used in atypical, more specialized contexts where
>> DataFrames or
>> >>>>>>>>>> DataSets don't fully fit.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <
>> hao.cheng@intel.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> I am not sure what the best practice for this specific
>> problem, but
>> >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful
>> issue for
>> >>>>>>>>>> lots of users.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API
>> (or
>> >>>>>>>>>> internal API only?)? As lots of its functionality overlapping
>> with DataFrame
>> >>>>>>>>>> or DataSet.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hao
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
>> >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
>> >>>>>>>>>> To: Nicholas Chammas
>> >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com;
>> dev@spark.apache.org;
>> >>>>>>>>>> Reynold Xin
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
>> hoping
>> >>>>>>>>>> that with Spark 2.0 we can also look at better classpath
>> isolation with user
>> >>>>>>>>>> programs. I propose we build on
>> spark.{driver|executor}.userClassPathFirst,
>> >>>>>>>>>> setting it true by default, and not allow any spark transitive
>> dependencies
>> >>>>>>>>>> to leak into user code. For backwards compatibility we can
>> have a whitelist
>> >>>>>>>>>> if we want but I'd be good if we start requiring user apps to
>> explicitly
>> >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3
>> is also
>> >>>>>>>>>> moving in this direction.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Kostas
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
>> >>>>>>>>>> <ni...@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> With regards to Machine learning, it would be great to move
>> useful
>> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
>> structure of two
>> >>>>>>>>>> separate machine learning packages seems to be somewhat
>> confusing.
>> >>>>>>>>>>
>> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
>> of
>> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
>> evolve with
>> >>>>>>>>>> Tungsten.
>> >>>>>>>>>>
>> >>>>>>>>>> On that note of deprecating stuff, it might be good to
>> deprecate
>> >>>>>>>>>> some things in 2.0 without removing or replacing them
>> immediately. That way
>> >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to
>> deprecate to be
>> >>>>>>>>>> replaced all at once.
>> >>>>>>>>>>
>> >>>>>>>>>> Nick
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
>> >>>>>>>>>> <al...@hpe.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Parameter Server is a new feature and thus does not match the
>> goal
>> >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API
>> and remove
>> >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to
>> have that
>> >>>>>>>>>> feature.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> With regards to Machine learning, it would be great to move
>> useful
>> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
>> structure of two
>> >>>>>>>>>> separate machine learning packages seems to be somewhat
>> confusing.
>> >>>>>>>>>>
>> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
>> of
>> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
>> evolve with
>> >>>>>>>>>> Tungsten.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Best regards, Alexander
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
>> >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
>> >>>>>>>>>> To: witgo@qq.com
>> >>>>>>>>>> Cc: dev@spark.apache.org
>> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Being specific to Parameter Server, I think the current
>> agreement
>> >>>>>>>>>> is that PS shall exist as a third-party library instead of a
>> component of
>> >>>>>>>>>> the core code base, isn’t?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>>
>> >>>>>>>>>> Nan Zhu
>> >>>>>>>>>>
>> >>>>>>>>>> http://codingcat.me
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Who has the idea of machine learning? Spark missing some
>> features
>> >>>>>>>>>> for machine learning, For example, the parameter server.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com>
>> 写道：
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I like the idea of popping out Tachyon to an optional
>> component too
>> >>>>>>>>>> to reduce the number of dependencies. In the future, it might
>> even be useful
>> >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to
>> be worth
>> >>>>>>>>>> doing now.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Regarding Scala 2.12, we should definitely support it
>> eventually,
>> >>>>>>>>>> but I don't think we need to block 2.0 on that because it can
>> be added later
>> >>>>>>>>>> too. Has anyone investigated what it would take to run on
>> there? I imagine
>> >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>> >>>>>>>>>> releases as undisruptive as possible in the model Reynold
>> proposed. Keeping
>> >>>>>>>>>> everyone working with the same set of releases is super
>> important.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Matei
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>
>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <
>> rxin@databricks.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> to the Spark community. A major release should not be very
>> >>>>>>>>>> different from a
>> >>>>>>>>>>
>> >>>>>>>>>> minor release and should not be gated based on new features.
>> The
>> >>>>>>>>>> main
>> >>>>>>>>>>
>> >>>>>>>>>> purpose of a major release is an opportunity to fix things
>> that are
>> >>>>>>>>>> broken
>> >>>>>>>>>>
>> >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
>> >>>>>>>>>> follow).
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Agree with this stance. Generally, a major release might also
>> be a
>> >>>>>>>>>>
>> >>>>>>>>>> time to replace some big old API or implementation with a new
>> one,
>> >>>>>>>>>> but
>> >>>>>>>>>>
>> >>>>>>>>>> I don't see obvious candidates.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later,
>> unless
>> >>>>>>>>>>
>> >>>>>>>>>> there's a fairly good reason to continue adding features in
>> 1.x to
>> >>>>>>>>>> a
>> >>>>>>>>>>
>> >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support
>> Scala
>> >>>>>>>>>> 2.10, but
>> >>>>>>>>>>
>> >>>>>>>>>> it has been end-of-life.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version,
>> 2.11
>> >>>>>>>>>> will
>> >>>>>>>>>>
>> >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>> >>>>>>>>>> propose
>> >>>>>>>>>>
>> >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 2. Remove Hadoop 1 support.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1
>> were
>> >>>>>>>>>>
>> >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I'm sure we'll think of a number of other small things --
>> shading a
>> >>>>>>>>>>
>> >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>> >>>>>>>>>>
>> >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>> >>>>>>>>>> this?)
>> >>>>>>>>>>
>> >>>>>>>>>> Pop out any Docker stuff to another repo?
>> >>>>>>>>>>
>> >>>>>>>>>> Continue that same effort for EC2?
>> >>>>>>>>>>
>> >>>>>>>>>> Farming out some of the "external" integrations to another
>> repo (?
>> >>>>>>>>>>
>> >>>>>>>>>> controversial)
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> See also anything marked version "2+" in JIRA.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>
>> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>
>> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>>
>> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: A proposal for Spark 2.0

Posted by kostas papageorgopoylos <p0...@gmail.com>.

Hi Kostas

With regards to your *second* point. I believe that requiring from the user
apps to explicitly declare their dependencies is the most clear API
approach when it comes to classpath and classloading.

However what about the following API: *SparkContext.addJar(String
pathToJar)* . *Is this going to change or affected in someway?*
Currently i use spark 1.5.2 in a Java application and i have built a
utility class that finds the correct path of a Dependency
(myPathOfTheJarDependency=Something like SparkUtils.getJarFullPathFromClass
(EsSparkSQL.class, "^elasticsearch-hadoop-2.2.0-beta1.*\\.jar$");), Which
is not something beatiful but i can live with.

Then i  use *javaSparkContext.addJar(myPathOfTheJarDependency)* ; after i
have initiated the javaSparkContext. In that way i do not require my
SparkCluster to have configuration on the classpath of my application and i
explicitly define the dependencies during runtime of my app after each time
i initiate a sparkContext.
I would be happy and i believe many other users also if i could could
continue having the same or similar approach with regards to dependencies


Regards

2015-12-08 23:40 GMT+02:00 Kostas Sakellis <ko...@cloudera.com>:

> I'd also like to make it a requirement that Spark 2.0 have a stable
> dataframe and dataset API - we should not leave these APIs experimental in
> the 2.0 release. We already know of at least one breaking change we need to
> make to dataframes, now's the time to make any other changes we need to
> stabilize these APIs. Anything we can do to make us feel more comfortable
> about the dataset and dataframe APIs before the 2.0 release?
>
> I've also been thinking that in Spark 2.0, we might want to consider
> strict classpath isolation for user applications. Hadoop 3 is moving in
> this direction. We could, for instance, run all user applications in their
> own classloader that only inherits very specific classes from Spark (ie.
> public APIs). This will require user apps to explicitly declare their
> dependencies as there won't be any accidental class leaking anymore. We do
> something like this for *userClasspathFirst option but it is not as strict
> as what I described. This is a breaking change but I think it will help
> with eliminating weird classpath incompatibility issues between user
> applications and Spark system dependencies.
>
> Thoughts?
>
> Kostas
>
>
> On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> To be clear-er, I don't think it's clear yet whether a 1.7 release
>> should exist or not. I could see both making sense. It's also not
>> really necessary to decide now, well before a 1.6 is even out in the
>> field. Deleting the version lost information, and I would not have
>> done that given my reply. Reynold maybe I can take this up with you
>> offline.
>>
>> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>> > Reynold's post fromNov. 25:
>> >
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> version
>> >> and retarget things to 2.0 on JIRA.
>> >
>> >
>> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> >> think that's premature. If there's a 1.7.0 then we've lost info about
>> >> what it would contain. It's trivial at any later point to merge the
>> >> versions. And, since things change and there's not a pressing need to
>> >> decide one way or the other, it seems fine to at least collect this
>> >> info like we have things like "1.4.3" that may never be released. I'd
>> >> like to add it back?
>> >>
>> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> > Maintaining both a 1.7 and 2.0 is too much work for the project,
>> which
>> >> > is over-stretched now. This means that after 1.6 it's just small
>> >> > maintenance releases in 1.x and no substantial features or evolution.
>> >> > This means that the "in progress" APIs in 1.x that will stay that
>> way,
>> >> > unless one updates to 2.x. It's not unreasonable, but means the
>> update
>> >> > to the 2.x line isn't going to be that optional for users.
>> >> >
>> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means
>> supporting
>> >> > it for a couple years, note. 2.10 is still used today, but that's the
>> >> > point of the current stable 1.x release in general: if you want to
>> >> > stick to current dependencies, stick to the current release. Although
>> >> > I think that's the right way to think about support across major
>> >> > versions in general, I can see that 2.x is more of a required update
>> >> > for those following the project's fixes and releases. Hence may
>> indeed
>> >> > be important to just keep supporting 2.10.
>> >> >
>> >> > I can't see supporting 2.12 at the same time (right?). Is that a
>> >> > concern? it will be long since GA by the time 2.x is first released.
>> >> >
>> >> > There's another fairly coherent worldview where development continues
>> >> > in 1.7 and focuses on finishing the loose ends and lots of bug
>> fixing.
>> >> > 2.0 is delayed somewhat into next year, and by that time supporting
>> >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
>> >> > currently deployed versions.
>> >> >
>> >> > I can't say I have a strong view but I personally hadn't imagined 2.x
>> >> > would start now.
>> >> >
>> >> >
>> >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
>> >> > wrote:
>> >> >> I don't think we should drop support for Scala 2.10, or make it
>> harder
>> >> >> in
>> >> >> terms of operations for people to upgrade.
>> >> >>
>> >> >> If there are further objections, I'm going to bump remove the 1.7
>> >> >> version
>> >> >> and retarget things to 2.0 on JIRA.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Kostas Sakellis <ko...@cloudera.com>.

I'd also like to make it a requirement that Spark 2.0 have a stable
dataframe and dataset API - we should not leave these APIs experimental in
the 2.0 release. We already know of at least one breaking change we need to
make to dataframes, now's the time to make any other changes we need to
stabilize these APIs. Anything we can do to make us feel more comfortable
about the dataset and dataframe APIs before the 2.0 release?

I've also been thinking that in Spark 2.0, we might want to consider strict
classpath isolation for user applications. Hadoop 3 is moving in this
direction. We could, for instance, run all user applications in their own
classloader that only inherits very specific classes from Spark (ie. public
APIs). This will require user apps to explicitly declare their dependencies
as there won't be any accidental class leaking anymore. We do something
like this for *userClasspathFirst option but it is not as strict as what I
described. This is a breaking change but I think it will help with
eliminating weird classpath incompatibility issues between user
applications and Spark system dependencies.

Thoughts?

Kostas


On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:

> To be clear-er, I don't think it's clear yet whether a 1.7 release
> should exist or not. I could see both making sense. It's also not
> really necessary to decide now, well before a 1.6 is even out in the
> field. Deleting the version lost information, and I would not have
> done that given my reply. Reynold maybe I can take this up with you
> offline.
>
> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > Reynold's post fromNov. 25:
> >
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >
> >
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> >> think that's premature. If there's a 1.7.0 then we've lost info about
> >> what it would contain. It's trivial at any later point to merge the
> >> versions. And, since things change and there's not a pressing need to
> >> decide one way or the other, it seems fine to at least collect this
> >> info like we have things like "1.4.3" that may never be released. I'd
> >> like to add it back?
> >>
> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
> >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> >> > is over-stretched now. This means that after 1.6 it's just small
> >> > maintenance releases in 1.x and no substantial features or evolution.
> >> > This means that the "in progress" APIs in 1.x that will stay that way,
> >> > unless one updates to 2.x. It's not unreasonable, but means the update
> >> > to the 2.x line isn't going to be that optional for users.
> >> >
> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> >> > it for a couple years, note. 2.10 is still used today, but that's the
> >> > point of the current stable 1.x release in general: if you want to
> >> > stick to current dependencies, stick to the current release. Although
> >> > I think that's the right way to think about support across major
> >> > versions in general, I can see that 2.x is more of a required update
> >> > for those following the project's fixes and releases. Hence may indeed
> >> > be important to just keep supporting 2.10.
> >> >
> >> > I can't see supporting 2.12 at the same time (right?). Is that a
> >> > concern? it will be long since GA by the time 2.x is first released.
> >> >
> >> > There's another fairly coherent worldview where development continues
> >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> >> > 2.0 is delayed somewhat into next year, and by that time supporting
> >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> >> > currently deployed versions.
> >> >
> >> > I can't say I have a strong view but I personally hadn't imagined 2.x
> >> > would start now.
> >> >
> >> >
> >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
> >> > wrote:
> >> >> I don't think we should drop support for Scala 2.10, or make it
> harder
> >> >> in
> >> >> terms of operations for people to upgrade.
> >> >>
> >> >> If there are further objections, I'm going to bump remove the 1.7
> >> >> version
> >> >> and retarget things to 2.0 on JIRA.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

To be clear-er, I don't think it's clear yet whether a 1.7 release
should exist or not. I could see both making sense. It's also not
really necessary to decide now, well before a 1.6 is even out in the
field. Deleting the version lost information, and I would not have
done that given my reply. Reynold maybe I can take this up with you
offline.

On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
> Reynold's post fromNov. 25:
>
>> I don't think we should drop support for Scala 2.10, or make it harder in
>> terms of operations for people to upgrade.
>>
>> If there are further objections, I'm going to bump remove the 1.7 version
>> and retarget things to 2.0 on JIRA.
>
>
> On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> think that's premature. If there's a 1.7.0 then we've lost info about
>> what it would contain. It's trivial at any later point to merge the
>> versions. And, since things change and there's not a pressing need to
>> decide one way or the other, it seems fine to at least collect this
>> info like we have things like "1.4.3" that may never be released. I'd
>> like to add it back?
>>
>> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
>> > is over-stretched now. This means that after 1.6 it's just small
>> > maintenance releases in 1.x and no substantial features or evolution.
>> > This means that the "in progress" APIs in 1.x that will stay that way,
>> > unless one updates to 2.x. It's not unreasonable, but means the update
>> > to the 2.x line isn't going to be that optional for users.
>> >
>> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
>> > it for a couple years, note. 2.10 is still used today, but that's the
>> > point of the current stable 1.x release in general: if you want to
>> > stick to current dependencies, stick to the current release. Although
>> > I think that's the right way to think about support across major
>> > versions in general, I can see that 2.x is more of a required update
>> > for those following the project's fixes and releases. Hence may indeed
>> > be important to just keep supporting 2.10.
>> >
>> > I can't see supporting 2.12 at the same time (right?). Is that a
>> > concern? it will be long since GA by the time 2.x is first released.
>> >
>> > There's another fairly coherent worldview where development continues
>> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
>> > 2.0 is delayed somewhat into next year, and by that time supporting
>> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
>> > currently deployed versions.
>> >
>> > I can't say I have a strong view but I personally hadn't imagined 2.x
>> > would start now.
>> >
>> >
>> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
>> > wrote:
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> >> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> >> version
>> >> and retarget things to 2.0 on JIRA.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Reynold's post fromNov. 25:

I don't think we should drop support for Scala 2.10, or make it harder in
> terms of operations for people to upgrade.
>
> If there are further objections, I'm going to bump remove the 1.7 version
> and retarget things to 2.0 on JIRA.
>

On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen <so...@cloudera.com> wrote:

> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> think that's premature. If there's a 1.7.0 then we've lost info about
> what it would contain. It's trivial at any later point to merge the
> versions. And, since things change and there's not a pressing need to
> decide one way or the other, it seems fine to at least collect this
> info like we have things like "1.4.3" that may never be released. I'd
> like to add it back?
>
> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> > is over-stretched now. This means that after 1.6 it's just small
> > maintenance releases in 1.x and no substantial features or evolution.
> > This means that the "in progress" APIs in 1.x that will stay that way,
> > unless one updates to 2.x. It's not unreasonable, but means the update
> > to the 2.x line isn't going to be that optional for users.
> >
> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> > it for a couple years, note. 2.10 is still used today, but that's the
> > point of the current stable 1.x release in general: if you want to
> > stick to current dependencies, stick to the current release. Although
> > I think that's the right way to think about support across major
> > versions in general, I can see that 2.x is more of a required update
> > for those following the project's fixes and releases. Hence may indeed
> > be important to just keep supporting 2.10.
> >
> > I can't see supporting 2.12 at the same time (right?). Is that a
> > concern? it will be long since GA by the time 2.x is first released.
> >
> > There's another fairly coherent worldview where development continues
> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> > 2.0 is delayed somewhat into next year, and by that time supporting
> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> > currently deployed versions.
> >
> > I can't say I have a strong view but I personally hadn't imagined 2.x
> > would start now.
> >
> >
> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >>
> >>
> >> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
> >> wrote:
> >>>
> >>> I see.  My concern is / was that cluster operators will be reluctant to
> >>> upgrade to 2.0, meaning that developers using those clusters need to
> stay on
> >>> 1.x, and, if they want to move to DataFrames, essentially need to port
> their
> >>> app twice.
> >>>
> >>> I misunderstood and thought part of the proposal was to drop support
> for
> >>> 2.10 though.  If your broad point is that there aren't changes in 2.0
> that
> >>> will make it less palatable to cluster administrators than releases in
> the
> >>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
> >>>
> >>> -Sandy
> >>>
> >>>
> >>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <
> matei.zaharia@gmail.com>
> >>> wrote:
> >>>>
> >>>> What are the other breaking changes in 2.0 though? Note that we're not
> >>>> removing Scala 2.10, we're just making the default build be against
> Scala
> >>>> 2.11 instead of 2.10. There seem to be very few changes that people
> would
> >>>> worry about. If people are going to update their apps, I think it's
> better
> >>>> to make the other small changes in 2.0 at the same time than to
> update once
> >>>> for Dataset and another time for 2.0.
> >>>>
> >>>> BTW just refer to Reynold's original post for the other proposed API
> >>>> changes.
> >>>>
> >>>> Matei
> >>>>
> >>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >>>>
> >>>> I think that Kostas' logic still holds.  The majority of Spark users,
> and
> >>>> likely an even vaster majority of people running vaster jobs, are
> still on
> >>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably
> want
> >>>> to upgrade to the stable version of the Dataset / DataFrame API so
> they
> >>>> don't need to do so twice.  Requiring that they absorb all the other
> ways
> >>>> that Spark breaks compatibility in the move to 2.0 makes it much more
> >>>> difficult for them to make this transition.
> >>>>
> >>>> Using the same set of APIs also means that it will be easier to
> backport
> >>>> critical fixes to the 1.x line.
> >>>>
> >>>> It's not clear to me that avoiding breakage of an experimental API in
> the
> >>>> 1.x line outweighs these issues.
> >>>>
> >>>> -Sandy
> >>>>
> >>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
> >>>>> reason is that I already know we have to break some part of the
> >>>>> DataFrame/Dataset API as part of the Dataset design. (e.g.
> DataFrame.map
> >>>>> should return Dataset rather than RDD). In that case, I'd rather
> break this
> >>>>> sooner (in one release) than later (in two releases). so the damage
> is
> >>>>> smaller.
> >>>>>
> >>>>> I don't think whether we call Dataset/DataFrame experimental or not
> >>>>> matters too much for 2.0. We can still call Dataset experimental in
> 2.0 and
> >>>>> then mark them as stable in 2.1. Despite being "experimental", there
> has
> >>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <
> mark@clearstorydata.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
> >>>>>> fixing.  We're on the same page now.
> >>>>>>
> >>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <
> kostas@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> A 1.6.x release will only fix bugs - we typically don't change
> APIs in
> >>>>>>> z releases. The Dataset API is experimental and so we might be
> changing the
> >>>>>>> APIs before we declare it stable. This is why I think it is
> important to
> >>>>>>> first stabilize the Dataset API with a Spark 1.7 release before
> moving to
> >>>>>>> Spark 2.0. This will benefit users that would like to use the new
> Dataset
> >>>>>>> APIs but can't move to Spark 2.0 because of the backwards
> incompatible
> >>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
> >>>>>>>
> >>>>>>> Kostas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
> >>>>>>> <ma...@clearstorydata.com> wrote:
> >>>>>>>>
> >>>>>>>> Why does stabilization of those two features require a 1.7 release
> >>>>>>>> instead of 1.6.1?
> >>>>>>>>
> >>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
> >>>>>>>> <ko...@cloudera.com> wrote:
> >>>>>>>>>
> >>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here -
> yes we
> >>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0.
> I'd like to
> >>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will
> allow us to
> >>>>>>>>> stabilize a few of the new features that were added in 1.6:
> >>>>>>>>>
> >>>>>>>>> 1) the experimental Datasets API
> >>>>>>>>> 2) the new unified memory manager.
> >>>>>>>>>
> >>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy
> transition
> >>>>>>>>> but there will be users that won't be able to seamlessly upgrade
> given what
> >>>>>>>>> we have discussed as in scope for 2.0. For these users, having a
> 1.x release
> >>>>>>>>> with these new features/APIs stabilized will be very beneficial.
> This might
> >>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a
> bad thing.
> >>>>>>>>>
> >>>>>>>>> Any thoughts on this timeline?
> >>>>>>>>>
> >>>>>>>>> Kostas Sakellis
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com
> >
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Agree, more features/apis/optimization need to be added in
> DF/DS.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
> >>>>>>>>>> provide to developer, maybe the fundamental API is enough,
> like, the
> >>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this
> category, as we
> >>>>>>>>>> can do the same thing easily with DF/DS, even better
> performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
> >>>>>>>>>> To: Stephen Boesch
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
> >>>>>>>>>> argues for retaining the RDD API but not as the first thing
> presented to new
> >>>>>>>>>> Spark developers: "Here's how to use groupBy with
> DataFrames.... Until the
> >>>>>>>>>> optimizer is more fully developed, that won't always get you
> the best
> >>>>>>>>>> performance that can be obtained.  In these particular
> circumstances, ...,
> >>>>>>>>>> you may want to use the low-level RDD API while setting
> >>>>>>>>>> preservesPartitioning to true.  Like this...."
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <
> javadba@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> My understanding is that  the RDD's presently have more support
> for
> >>>>>>>>>> complete control of partitioning which is a key consideration
> at scale.
> >>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it
> would seem
> >>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> An example is the use of groupBy when we know that the source
> >>>>>>>>>> relation (/RDD) is already partitioned on the grouping
> expressions.  AFAIK
> >>>>>>>>>> the spark sql still does not allow that knowledge to be applied
> to the
> >>>>>>>>>> optimizer - so a full shuffle will be performed. However in the
> native RDD
> >>>>>>>>>> we can use preservesPartitioning=true.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <
> mark@clearstorydata.com>:
> >>>>>>>>>>
> >>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
> >>>>>>>>>> wondering about.  I think it may be going too far to deprecate
> it, but
> >>>>>>>>>> changing emphasis is something that we might consider.  The RDD
> API came
> >>>>>>>>>> well before DataFrames and DataSets, so programming guides,
> introductory
> >>>>>>>>>> how-to articles and the like have, to this point, also tended
> to emphasize
> >>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking
> is that with
> >>>>>>>>>> 2.0 maybe we should overhaul all the documentation to
> de-emphasize and
> >>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would
> be
> >>>>>>>>>> introduced and fully addressed before RDDs.  They would be
> presented as the
> >>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in
> contrast, would
> >>>>>>>>>> be presented later as a kind of lower-level,
> closer-to-the-metal API that
> >>>>>>>>>> can be used in atypical, more specialized contexts where
> DataFrames or
> >>>>>>>>>> DataSets don't fully fit.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <
> hao.cheng@intel.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I am not sure what the best practice for this specific problem,
> but
> >>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful
> issue for
> >>>>>>>>>> lots of users.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API
> (or
> >>>>>>>>>> internal API only?)? As lots of its functionality overlapping
> with DataFrame
> >>>>>>>>>> or DataSet.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
> >>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
> >>>>>>>>>> To: Nicholas Chammas
> >>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com;
> dev@spark.apache.org;
> >>>>>>>>>> Reynold Xin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
> hoping
> >>>>>>>>>> that with Spark 2.0 we can also look at better classpath
> isolation with user
> >>>>>>>>>> programs. I propose we build on
> spark.{driver|executor}.userClassPathFirst,
> >>>>>>>>>> setting it true by default, and not allow any spark transitive
> dependencies
> >>>>>>>>>> to leak into user code. For backwards compatibility we can have
> a whitelist
> >>>>>>>>>> if we want but I'd be good if we start requiring user apps to
> explicitly
> >>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3
> is also
> >>>>>>>>>> moving in this direction.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Kostas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
> >>>>>>>>>> <ni...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
> >>>>>>>>>> some things in 2.0 without removing or replacing them
> immediately. That way
> >>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to
> deprecate to be
> >>>>>>>>>> replaced all at once.
> >>>>>>>>>>
> >>>>>>>>>> Nick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
> >>>>>>>>>> <al...@hpe.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Parameter Server is a new feature and thus does not match the
> goal
> >>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and
> remove
> >>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to
> have that
> >>>>>>>>>> feature.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> With regards to Machine learning, it would be great to move
> useful
> >>>>>>>>>> features from MLlib to ML and deprecate the former. Current
> structure of two
> >>>>>>>>>> separate machine learning packages seems to be somewhat
> confusing.
> >>>>>>>>>>
> >>>>>>>>>> With regards to GraphX, it would be great to deprecate the use
> of
> >>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX
> evolve with
> >>>>>>>>>> Tungsten.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best regards, Alexander
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
> >>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
> >>>>>>>>>> To: witgo@qq.com
> >>>>>>>>>> Cc: dev@spark.apache.org
> >>>>>>>>>> Subject: Re: A proposal for Spark 2.0
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Being specific to Parameter Server, I think the current
> agreement
> >>>>>>>>>> is that PS shall exist as a third-party library instead of a
> component of
> >>>>>>>>>> the core code base, isn’t?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Nan Zhu
> >>>>>>>>>>
> >>>>>>>>>> http://codingcat.me
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
> >>>>>>>>>>
> >>>>>>>>>> Who has the idea of machine learning? Spark missing some
> features
> >>>>>>>>>> for machine learning, For example, the parameter server.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I like the idea of popping out Tachyon to an optional component
> too
> >>>>>>>>>> to reduce the number of dependencies. In the future, it might
> even be useful
> >>>>>>>>>> to do this for Hadoop, but it requires too many API changes to
> be worth
> >>>>>>>>>> doing now.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Regarding Scala 2.12, we should definitely support it
> eventually,
> >>>>>>>>>> but I don't think we need to block 2.0 on that because it can
> be added later
> >>>>>>>>>> too. Has anyone investigated what it would take to run on
> there? I imagine
> >>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
> >>>>>>>>>> releases as undisruptive as possible in the model Reynold
> proposed. Keeping
> >>>>>>>>>> everyone working with the same set of releases is super
> important.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Matei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <
> rxin@databricks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> to the Spark community. A major release should not be very
> >>>>>>>>>> different from a
> >>>>>>>>>>
> >>>>>>>>>> minor release and should not be gated based on new features. The
> >>>>>>>>>> main
> >>>>>>>>>>
> >>>>>>>>>> purpose of a major release is an opportunity to fix things that
> are
> >>>>>>>>>> broken
> >>>>>>>>>>
> >>>>>>>>>> in the current API and remove certain deprecated APIs (examples
> >>>>>>>>>> follow).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Agree with this stance. Generally, a major release might also
> be a
> >>>>>>>>>>
> >>>>>>>>>> time to replace some big old API or implementation with a new
> one,
> >>>>>>>>>> but
> >>>>>>>>>>
> >>>>>>>>>> I don't see obvious candidates.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later,
> unless
> >>>>>>>>>>
> >>>>>>>>>> there's a fairly good reason to continue adding features in 1.x
> to
> >>>>>>>>>> a
> >>>>>>>>>>
> >>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. Scala 2.11 as the default build. We should still support
> Scala
> >>>>>>>>>> 2.10, but
> >>>>>>>>>>
> >>>>>>>>>> it has been end-of-life.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version,
> 2.11
> >>>>>>>>>> will
> >>>>>>>>>>
> >>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
> >>>>>>>>>> propose
> >>>>>>>>>>
> >>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. Remove Hadoop 1 support.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1
> were
> >>>>>>>>>>
> >>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'm sure we'll think of a number of other small things --
> shading a
> >>>>>>>>>>
> >>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
> >>>>>>>>>>
> >>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
> >>>>>>>>>> this?)
> >>>>>>>>>>
> >>>>>>>>>> Pop out any Docker stuff to another repo?
> >>>>>>>>>>
> >>>>>>>>>> Continue that same effort for EC2?
> >>>>>>>>>>
> >>>>>>>>>> Farming out some of the "external" integrations to another repo
> (?
> >>>>>>>>>>
> >>>>>>>>>> controversial)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> See also anything marked version "2+" in JIRA.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
think that's premature. If there's a 1.7.0 then we've lost info about
what it would contain. It's trivial at any later point to merge the
versions. And, since things change and there's not a pressing need to
decide one way or the other, it seems fine to at least collect this
info like we have things like "1.4.3" that may never be released. I'd
like to add it back?

On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen <so...@cloudera.com> wrote:
> Maintaining both a 1.7 and 2.0 is too much work for the project, which
> is over-stretched now. This means that after 1.6 it's just small
> maintenance releases in 1.x and no substantial features or evolution.
> This means that the "in progress" APIs in 1.x that will stay that way,
> unless one updates to 2.x. It's not unreasonable, but means the update
> to the 2.x line isn't going to be that optional for users.
>
> Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> it for a couple years, note. 2.10 is still used today, but that's the
> point of the current stable 1.x release in general: if you want to
> stick to current dependencies, stick to the current release. Although
> I think that's the right way to think about support across major
> versions in general, I can see that 2.x is more of a required update
> for those following the project's fixes and releases. Hence may indeed
> be important to just keep supporting 2.10.
>
> I can't see supporting 2.12 at the same time (right?). Is that a
> concern? it will be long since GA by the time 2.x is first released.
>
> There's another fairly coherent worldview where development continues
> in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> 2.0 is delayed somewhat into next year, and by that time supporting
> 2.11+2.12 and Java 8 looks more feasible and more in tune with
> currently deployed versions.
>
> I can't say I have a strong view but I personally hadn't imagined 2.x
> would start now.
>
>
> On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com> wrote:
>> I don't think we should drop support for Scala 2.10, or make it harder in
>> terms of operations for people to upgrade.
>>
>> If there are further objections, I'm going to bump remove the 1.7 version
>> and retarget things to 2.0 on JIRA.
>>
>>
>> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
>> wrote:
>>>
>>> I see.  My concern is / was that cluster operators will be reluctant to
>>> upgrade to 2.0, meaning that developers using those clusters need to stay on
>>> 1.x, and, if they want to move to DataFrames, essentially need to port their
>>> app twice.
>>>
>>> I misunderstood and thought part of the proposal was to drop support for
>>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>>> will make it less palatable to cluster administrators than releases in the
>>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>>
>>> -Sandy
>>>
>>>
>>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
>>> wrote:
>>>>
>>>> What are the other breaking changes in 2.0 though? Note that we're not
>>>> removing Scala 2.10, we're just making the default build be against Scala
>>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>>> worry about. If people are going to update their apps, I think it's better
>>>> to make the other small changes in 2.0 at the same time than to update once
>>>> for Dataset and another time for 2.0.
>>>>
>>>> BTW just refer to Reynold's original post for the other proposed API
>>>> changes.
>>>>
>>>> Matei
>>>>
>>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>>>>
>>>> I think that Kostas' logic still holds.  The majority of Spark users, and
>>>> likely an even vaster majority of people running vaster jobs, are still on
>>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>>>> to upgrade to the stable version of the Dataset / DataFrame API so they
>>>> don't need to do so twice.  Requiring that they absorb all the other ways
>>>> that Spark breaks compatibility in the move to 2.0 makes it much more
>>>> difficult for them to make this transition.
>>>>
>>>> Using the same set of APIs also means that it will be easier to backport
>>>> critical fixes to the 1.x line.
>>>>
>>>> It's not clear to me that avoiding breakage of an experimental API in the
>>>> 1.x line outweighs these issues.
>>>>
>>>> -Sandy
>>>>
>>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>>
>>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>>> reason is that I already know we have to break some part of the
>>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>>>> sooner (in one release) than later (in two releases). so the damage is
>>>>> smaller.
>>>>>
>>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>>>>> wrote:
>>>>>>
>>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>>> fixing.  We're on the same page now.
>>>>>>
>>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>>
>>>>>>> Kostas
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
>>>>>>> <ma...@clearstorydata.com> wrote:
>>>>>>>>
>>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>>> instead of 1.6.1?
>>>>>>>>
>>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
>>>>>>>> <ko...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>>>>>>>>> stabilize a few of the new features that were added in 1.6:
>>>>>>>>>
>>>>>>>>> 1) the experimental Datasets API
>>>>>>>>> 2) the new unified memory manager.
>>>>>>>>>
>>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x release
>>>>>>>>> with these new features/APIs stabilized will be very beneficial. This might
>>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>>>>>>
>>>>>>>>> Any thoughts on this timeline?
>>>>>>>>>
>>>>>>>>> Kostas Sakellis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as we
>>>>>>>>>> can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
>>>>>>>>>> To: Stephen Boesch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cc: dev@spark.apache.org
>>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>>>> argues for retaining the RDD API but not as the first thing presented to new
>>>>>>>>>> Spark developers: "Here's how to use groupBy with DataFrames.... Until the
>>>>>>>>>> optimizer is more fully developed, that won't always get you the best
>>>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>>>>
>>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that with
>>>>>>>>>> 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast, would
>>>>>>>>>> be presented later as a kind of lower-level, closer-to-the-metal API that
>>>>>>>>>> can be used in atypical, more specialized contexts where DataFrames or
>>>>>>>>>> DataSets don't fully fit.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>>>>>> lots of users.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>>>> internal API only?)? As lots of its functionality overlapping with DataFrame
>>>>>>>>>> or DataSet.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hao
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
>>>>>>>>>> To: Nicholas Chammas
>>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>>>>>>> Reynold Xin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with user
>>>>>>>>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>>>>>>>>> setting it true by default, and not allow any spark transitive dependencies
>>>>>>>>>> to leak into user code. For backwards compatibility we can have a whitelist
>>>>>>>>>> if we want but I'd be good if we start requiring user apps to explicitly
>>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>>>>>>>>> moving in this direction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Kostas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
>>>>>>>>>> <ni...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
>>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>>
>>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>>> Tungsten.
>>>>>>>>>>
>>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>>>> replaced all at once.
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
>>>>>>>>>> <al...@hpe.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>>>>> feature.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
>>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>>
>>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>>> Tungsten.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best regards, Alexander
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
>>>>>>>>>> To: witgo@qq.com
>>>>>>>>>> Cc: dev@spark.apache.org
>>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>>>> is that PS shall exist as a third-party library instead of a component of
>>>>>>>>>> the core code base, isn’t?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Nan Zhu
>>>>>>>>>>
>>>>>>>>>> http://codingcat.me
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>>>>
>>>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>>>>> to reduce the number of dependencies. In the future, it might even be useful
>>>>>>>>>> to do this for Hadoop, but it requires too many API changes to be worth
>>>>>>>>>> doing now.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>>>> but I don't think we need to block 2.0 on that because it can be added later
>>>>>>>>>> too. Has anyone investigated what it would take to run on there? I imagine
>>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
>>>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>>>> different from a
>>>>>>>>>>
>>>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>>>> main
>>>>>>>>>>
>>>>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>>>>> broken
>>>>>>>>>>
>>>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>>>> follow).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>>>
>>>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>>>> but
>>>>>>>>>>
>>>>>>>>>> I don't see obvious candidates.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>>>
>>>>>>>>>> there's a fairly good reason to continue adding features in 1.x to
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>>>> 2.10, but
>>>>>>>>>>
>>>>>>>>>> it has been end-of-life.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>>>> propose
>>>>>>>>>>
>>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>>>
>>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>>>
>>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>>>
>>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>>>> this?)
>>>>>>>>>>
>>>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>>>
>>>>>>>>>> Continue that same effort for EC2?
>>>>>>>>>>
>>>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>>>
>>>>>>>>>> controversial)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

Maintaining both a 1.7 and 2.0 is too much work for the project, which
is over-stretched now. This means that after 1.6 it's just small
maintenance releases in 1.x and no substantial features or evolution.
This means that the "in progress" APIs in 1.x that will stay that way,
unless one updates to 2.x. It's not unreasonable, but means the update
to the 2.x line isn't going to be that optional for users.

Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
it for a couple years, note. 2.10 is still used today, but that's the
point of the current stable 1.x release in general: if you want to
stick to current dependencies, stick to the current release. Although
I think that's the right way to think about support across major
versions in general, I can see that 2.x is more of a required update
for those following the project's fixes and releases. Hence may indeed
be important to just keep supporting 2.10.

I can't see supporting 2.12 at the same time (right?). Is that a
concern? it will be long since GA by the time 2.x is first released.

There's another fairly coherent worldview where development continues
in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
2.0 is delayed somewhat into next year, and by that time supporting
2.11+2.12 and Java 8 looks more feasible and more in tune with
currently deployed versions.

I can't say I have a strong view but I personally hadn't imagined 2.x
would start now.


On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin <rx...@databricks.com> wrote:
> I don't think we should drop support for Scala 2.10, or make it harder in
> terms of operations for people to upgrade.
>
> If there are further objections, I'm going to bump remove the 1.7 version
> and retarget things to 2.0 on JIRA.
>
>
> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay on
>> 1.x, and, if they want to move to DataFrames, essentially need to port their
>> app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users, and
>>> likely an even vaster majority of people running vaster jobs, are still on
>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>>> to upgrade to the stable version of the Dataset / DataFrame API so they
>>> don't need to do so twice.  Requiring that they absorb all the other ways
>>> that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in the
>>> 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>>
>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>> reason is that I already know we have to break some part of the
>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>>> sooner (in one release) than later (in two releases). so the damage is
>>>> smaller.
>>>>
>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>
>>>>
>>>>
>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>>
>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>> fixing.  We're on the same page now.
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>>>> wrote:
>>>>>>
>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra
>>>>>> <ma...@clearstorydata.com> wrote:
>>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis
>>>>>>> <ko...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>>>>>>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>>>>>>>> stabilize a few of the new features that were added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x release
>>>>>>>> with these new features/APIs stabilized will be very beneficial. This might
>>>>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as we
>>>>>>>>> can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>>> Sent: Friday, November 13, 2015 11:23 AM
>>>>>>>>> To: Stephen Boesch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cc: dev@spark.apache.org
>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>>> argues for retaining the RDD API but not as the first thing presented to new
>>>>>>>>> Spark developers: "Here's how to use groupBy with DataFrames.... Until the
>>>>>>>>> optimizer is more fully developed, that won't always get you the best
>>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>>>
>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that with
>>>>>>>>> 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast, would
>>>>>>>>> be presented later as a kind of lower-level, closer-to-the-metal API that
>>>>>>>>> can be used in atypical, more specialized contexts where DataFrames or
>>>>>>>>> DataSets don't fully fit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>>>>> lots of users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>>> internal API only?)? As lots of its functionality overlapping with DataFrame
>>>>>>>>> or DataSet.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>>> Sent: Friday, November 13, 2015 5:27 AM
>>>>>>>>> To: Nicholas Chammas
>>>>>>>>> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>>>>>> Reynold Xin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with user
>>>>>>>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>>>>>>>> setting it true by default, and not allow any spark transitive dependencies
>>>>>>>>> to leak into user code. For backwards compatibility we can have a whitelist
>>>>>>>>> if we want but I'd be good if we start requiring user apps to explicitly
>>>>>>>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>>>>>>>> moving in this direction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kostas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas
>>>>>>>>> <ni...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>>> replaced all at once.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander
>>>>>>>>> <al...@hpe.com> wrote:
>>>>>>>>>
>>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>>>> feature.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of two
>>>>>>>>> separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards, Alexander
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>>> Sent: Thursday, November 12, 2015 7:28 AM
>>>>>>>>> To: witgo@qq.com
>>>>>>>>> Cc: dev@spark.apache.org
>>>>>>>>> Subject: Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>>> is that PS shall exist as a third-party library instead of a component of
>>>>>>>>> the core code base, isn’t?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Nan Zhu
>>>>>>>>>
>>>>>>>>> http://codingcat.me
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>>>
>>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>>>> to reduce the number of dependencies. In the future, it might even be useful
>>>>>>>>> to do this for Hadoop, but it requires too many API changes to be worth
>>>>>>>>> doing now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>>> but I don't think we need to block 2.0 on that because it can be added later
>>>>>>>>> too. Has anyone investigated what it would take to run on there? I imagine
>>>>>>>>> we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
>>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>>> different from a
>>>>>>>>>
>>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>>> main
>>>>>>>>>
>>>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>>>> broken
>>>>>>>>>
>>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>>> follow).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>>
>>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>>> but
>>>>>>>>>
>>>>>>>>> I don't see obvious candidates.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>>
>>>>>>>>> there's a fairly good reason to continue adding features in 1.x to
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>>> 2.10, but
>>>>>>>>>
>>>>>>>>> it has been end-of-life.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>>> will
>>>>>>>>>
>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>>> propose
>>>>>>>>>
>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>>
>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>>
>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>>
>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>>> this?)
>>>>>>>>>
>>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>>
>>>>>>>>> Continue that same effort for EC2?
>>>>>>>>>
>>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>>
>>>>>>>>> controversial)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

I don't think we should drop support for Scala 2.10, or make it harder in
terms of operations for people to upgrade.

If there are further objections, I'm going to bump remove the 1.7 version
and retarget things to 2.0 on JIRA.


On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza <sa...@cloudera.com>
wrote:

> I see.  My concern is / was that cluster operators will be reluctant to
> upgrade to 2.0, meaning that developers using those clusters need to stay
> on 1.x, and, if they want to move to DataFrames, essentially need to port
> their app twice.
>
> I misunderstood and thought part of the proposal was to drop support for
> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> will make it less palatable to cluster administrators than releases in the
> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>
> -Sandy
>
>
> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> What are the other breaking changes in 2.0 though? Note that we're not
>> removing Scala 2.10, we're just making the default build be against Scala
>> 2.11 instead of 2.10. There seem to be very few changes that people would
>> worry about. If people are going to update their apps, I think it's better
>> to make the other small changes in 2.0 at the same time than to update once
>> for Dataset and another time for 2.0.
>>
>> BTW just refer to Reynold's original post for the other proposed API
>> changes.
>>
>> Matei
>>
>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>>
>> I think that Kostas' logic still holds.  The majority of Spark users, and
>> likely an even vaster majority of people running vaster jobs, are still on
>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>> to upgrade to the stable version of the Dataset / DataFrame API so they
>> don't need to do so twice.  Requiring that they absorb all the other ways
>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> difficult for them to make this transition.
>>
>> Using the same set of APIs also means that it will be easier to backport
>> critical fixes to the 1.x line.
>>
>> It's not clear to me that avoiding breakage of an experimental API in the
>> 1.x line outweighs these issues.
>>
>> -Sandy
>>
>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>> reason is that I already know we have to break some part of the
>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>> sooner (in one release) than later (in two releases). so the damage is
>>> smaller.
>>>
>>> I don't think whether we call Dataset/DataFrame experimental or not
>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>> fixing.  We're on the same page now.
>>>>
>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>>> wrote:
>>>>
>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>> mark@clearstorydata.com> wrote:
>>>>>
>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>> instead of 1.6.1?
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>> kostas@cloudera.com> wrote:
>>>>>>
>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>>
>>>>>>> 1) the experimental Datasets API
>>>>>>> 2) the new unified memory manager.
>>>>>>>
>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a
>>>>>>> bad thing.
>>>>>>>
>>>>>>> Any thoughts on this timeline?
>>>>>>>
>>>>>>> Kostas Sakellis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>> *To:* Stephen Boesch
>>>>>>>>
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... Until
>>>>>>>> the optimizer is more fully developed, that won't always get you the best
>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>>
>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>>>>> DataSets don't fully fit.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>>>> lots of users.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>>> DataFrame or DataSet.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>> *To:* Nicholas Chammas
>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com;
>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>
>>>>>>>>
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>>>>> user programs. I propose we build on
>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>>> direction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Kostas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>> replaced all at once.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>>>
>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>>> feature.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards, Alexander
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>> *To:* witgo@qq.com
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>> is that PS shall exist as a third-party library instead of a component of
>>>>>>>> the core code base, isn’t?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Nan Zhu
>>>>>>>>
>>>>>>>> http://codingcat.me
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>>
>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>>> to reduce the number of dependencies. In the future, it might even be
>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to be
>>>>>>>> worth doing now.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>>> later too. Has anyone investigated what it would take to run on there? I
>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>> different from a
>>>>>>>>
>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>> main
>>>>>>>>
>>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>>> broken
>>>>>>>>
>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>> follow).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>
>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>> but
>>>>>>>>
>>>>>>>> I don't see obvious candidates.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>
>>>>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>>>>
>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>> 2.10, but
>>>>>>>>
>>>>>>>> it has been end-of-life.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>> will
>>>>>>>>
>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>> propose
>>>>>>>>
>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>
>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>
>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>
>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>> this?)
>>>>>>>>
>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>
>>>>>>>> Continue that same effort for EC2?
>>>>>>>>
>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>
>>>>>>>> controversial)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

I don't think there are any plan for Scala 2.12 support yet. We can always
add Scala 2.12 support later.


On Thu, Nov 26, 2015 at 12:59 PM, Koert Kuipers <ko...@tresata.com> wrote:

> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
> On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sa...@cloudera.com> wrote:
>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay
>> on 1.x, and, if they want to move to DataFrames, essentially need to port
>> their app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com>
>>> wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users,
>>> and likely an even vaster majority of people running vaster jobs, are still
>>> on RDDs and on the cusp of upgrading to DataFrames.  Users will probably
>>> want to upgrade to the stable version of the Dataset / DataFrame API so
>>> they don't need to do so twice.  Requiring that they absorb all the other
>>> ways that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in
>>> the 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>>> reason is that I already know we have to break some part of the
>>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>>> sooner (in one release) than later (in two releases). so the damage is
>>>> smaller.
>>>>
>>>> I don't think whether we call Dataset/DataFrame experimental or not
>>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>>
>>>>
>>>>
>>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>>> fixing.  We're on the same page now.
>>>>>
>>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs
>>>>>> in z releases. The Dataset API is experimental and so we might be changing
>>>>>> the APIs before we declare it stable. This is why I think it is important
>>>>>> to first stabilize the Dataset API with a Spark 1.7 release before moving
>>>>>> to Spark 2.0. This will benefit users that would like to use the new
>>>>>> Dataset APIs but can't move to Spark 2.0 because of the backwards
>>>>>> incompatible changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>>
>>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>>> instead of 1.6.1?
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>>> kostas@cloudera.com> wrote:
>>>>>>>
>>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes
>>>>>>>> we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd
>>>>>>>> like to propose we have one more 1.x release after Spark 1.6. This will
>>>>>>>> allow us to stabilize a few of the new features that were added in 1.6:
>>>>>>>>
>>>>>>>> 1) the experimental Datasets API
>>>>>>>> 2) the new unified memory manager.
>>>>>>>>
>>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a
>>>>>>>> bad thing.
>>>>>>>>
>>>>>>>> Any thoughts on this timeline?
>>>>>>>>
>>>>>>>> Kostas Sakellis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>>> *To:* Stephen Boesch
>>>>>>>>>
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... Until
>>>>>>>>> the optimizer is more fully developed, that won't always get you the best
>>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My understanding is that  the RDD's presently have more support
>>>>>>>>> for complete control of partitioning which is a key consideration at
>>>>>>>>> scale.  While partitioning control is still piecemeal in  DF/DS  it would
>>>>>>>>> seem premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>>>
>>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>>>>>> DataSets don't fully fit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am not sure what the best practice for this specific problem,
>>>>>>>>> but it’s really worth to think about it in 2.0, as it is a painful issue
>>>>>>>>> for lots of users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>>>> DataFrame or DataSet.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>>> *To:* Nicholas Chammas
>>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com;
>>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I know we want to keep breaking changes to a minimum but I'm
>>>>>>>>> hoping that with Spark 2.0 we can also look at better classpath isolation
>>>>>>>>> with user programs. I propose we build on
>>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>>>> direction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kostas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>>> replaced all at once.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>>>>
>>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>>>> feature.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>>
>>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>>> Tungsten.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards, Alexander
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>>> *To:* witgo@qq.com
>>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>>> is that PS shall exist as a third-party library instead of a component of
>>>>>>>>> the core code base, isn’t?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Nan Zhu
>>>>>>>>>
>>>>>>>>> http://codingcat.me
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>>>
>>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I like the idea of popping out Tachyon to an optional component
>>>>>>>>> too to reduce the number of dependencies. In the future, it might even be
>>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to be
>>>>>>>>> worth doing now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>>>> later too. Has anyone investigated what it would take to run on there? I
>>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
>>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>>> different from a
>>>>>>>>>
>>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>>> main
>>>>>>>>>
>>>>>>>>> purpose of a major release is an opportunity to fix things that
>>>>>>>>> are broken
>>>>>>>>>
>>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>>> follow).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>>
>>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>>> but
>>>>>>>>>
>>>>>>>>> I don't see obvious candidates.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>>
>>>>>>>>> there's a fairly good reason to continue adding features in 1.x to
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>>> 2.10, but
>>>>>>>>>
>>>>>>>>> it has been end-of-life.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>>> will
>>>>>>>>>
>>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>>> propose
>>>>>>>>>
>>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>>
>>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>>
>>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>>
>>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>>> this?)
>>>>>>>>>
>>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>>
>>>>>>>>> Continue that same effort for EC2?
>>>>>>>>>
>>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>>
>>>>>>>>> controversial)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

Pardon for tacking on one more message to this thread, but I'm
reminded of one more issue when building the RC today: Scala 2.10 does
not in general try to work with Java 8, and indeed I can never fully
compile it with Java 8 on Ubuntu or OS X, due to scalac assertion
errors. 2.11 is the first that's supposed to work with Java 8. This
may be a good reason to drop 2.10 by the time this comes up.

On Thu, Nov 26, 2015 at 8:59 PM, Koert Kuipers <ko...@tresata.com> wrote:
> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.

I also thought the idea was to drop 2.10. Do we want to cross build for 3
scala versions?
On Nov 25, 2015 3:54 AM, "Sandy Ryza" <sa...@cloudera.com> wrote:

> I see.  My concern is / was that cluster operators will be reluctant to
> upgrade to 2.0, meaning that developers using those clusters need to stay
> on 1.x, and, if they want to move to DataFrames, essentially need to port
> their app twice.
>
> I misunderstood and thought part of the proposal was to drop support for
> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> will make it less palatable to cluster administrators than releases in the
> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>
> -Sandy
>
>
> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> What are the other breaking changes in 2.0 though? Note that we're not
>> removing Scala 2.10, we're just making the default build be against Scala
>> 2.11 instead of 2.10. There seem to be very few changes that people would
>> worry about. If people are going to update their apps, I think it's better
>> to make the other small changes in 2.0 at the same time than to update once
>> for Dataset and another time for 2.0.
>>
>> BTW just refer to Reynold's original post for the other proposed API
>> changes.
>>
>> Matei
>>
>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>>
>> I think that Kostas' logic still holds.  The majority of Spark users, and
>> likely an even vaster majority of people running vaster jobs, are still on
>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>> to upgrade to the stable version of the Dataset / DataFrame API so they
>> don't need to do so twice.  Requiring that they absorb all the other ways
>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> difficult for them to make this transition.
>>
>> Using the same set of APIs also means that it will be easier to backport
>> critical fixes to the 1.x line.
>>
>> It's not clear to me that avoiding breakage of an experimental API in the
>> 1.x line outweighs these issues.
>>
>> -Sandy
>>
>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>> reason is that I already know we have to break some part of the
>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>> sooner (in one release) than later (in two releases). so the damage is
>>> smaller.
>>>
>>> I don't think whether we call Dataset/DataFrame experimental or not
>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>>> fixing.  We're on the same page now.
>>>>
>>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>>> wrote:
>>>>
>>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>>> APIs before we declare it stable. This is why I think it is important to
>>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>>>>> mark@clearstorydata.com> wrote:
>>>>>
>>>>>> Why does stabilization of those two features require a 1.7 release
>>>>>> instead of 1.6.1?
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>>>>> kostas@cloudera.com> wrote:
>>>>>>
>>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>>
>>>>>>> 1) the experimental Datasets API
>>>>>>> 2) the new unified memory manager.
>>>>>>>
>>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a
>>>>>>> bad thing.
>>>>>>>
>>>>>>> Any thoughts on this timeline?
>>>>>>>
>>>>>>> Kostas Sakellis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>>> *To:* Stephen Boesch
>>>>>>>>
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... Until
>>>>>>>> the optimizer is more fully developed, that won't always get you the best
>>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>>> we can use preservesPartitioning=true.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>>
>>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>>>>> DataSets don't fully fit.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>>>> lots of users.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>>> DataFrame or DataSet.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>>> *To:* Nicholas Chammas
>>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com;
>>>>>>>> dev@spark.apache.org; Reynold Xin
>>>>>>>>
>>>>>>>>
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>>>>> user programs. I propose we build on
>>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>>> direction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Kostas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>>> replaced all at once.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>>>
>>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>>> feature.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>>
>>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>>> Tungsten.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards, Alexander
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>>> *To:* witgo@qq.com
>>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Being specific to Parameter Server, I think the current agreement
>>>>>>>> is that PS shall exist as a third-party library instead of a component of
>>>>>>>> the core code base, isn’t?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Nan Zhu
>>>>>>>>
>>>>>>>> http://codingcat.me
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>>
>>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>>> to reduce the number of dependencies. In the future, it might even be
>>>>>>>> useful to do this for Hadoop, but it requires too many API changes to be
>>>>>>>> worth doing now.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>>> later too. Has anyone investigated what it would take to run on there? I
>>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Needless to say, but I'm all for the idea of making "major"
>>>>>>>> releases as undisruptive as possible in the model Reynold proposed. Keeping
>>>>>>>> everyone working with the same set of releases is super important.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> to the Spark community. A major release should not be very
>>>>>>>> different from a
>>>>>>>>
>>>>>>>> minor release and should not be gated based on new features. The
>>>>>>>> main
>>>>>>>>
>>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>>> broken
>>>>>>>>
>>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>>> follow).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>>
>>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>>> but
>>>>>>>>
>>>>>>>> I don't see obvious candidates.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>>
>>>>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>>>>
>>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>>> 2.10, but
>>>>>>>>
>>>>>>>> it has been end-of-life.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>>> will
>>>>>>>>
>>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd
>>>>>>>> propose
>>>>>>>>
>>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>>
>>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>>
>>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>>
>>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed
>>>>>>>> this?)
>>>>>>>>
>>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>>
>>>>>>>> Continue that same effort for EC2?
>>>>>>>>
>>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>>
>>>>>>>> controversial)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Steve Loughran <st...@hortonworks.com>.

> On 25 Nov 2015, at 08:54, Sandy Ryza <sa...@cloudera.com> wrote:
> 
> I see.  My concern is / was that cluster operators will be reluctant to upgrade to 2.0, meaning that developers using those clusters need to stay on 1.x, and, if they want to move to DataFrames, essentially need to port their app twice.
> 
> I misunderstood and thought part of the proposal was to drop support for 2.10 though.  If your broad point is that there aren't changes in 2.0 that will make it less palatable to cluster administrators than releases in the 1.x line, then yes, 2.0 as the next release sounds fine to me.
> 
> -Sandy
> 

mixing spark versions in a JAR cluster with compatible hadoop native libs isn't so hard: users just deploy them up separately. 

But: 

-mixing Scala version is going to be tricky unless the jobs people submit are configured with the different paths
-the history server will need to be of the most latest spark version being executed in the cluster

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Sandy Ryza <sa...@cloudera.com>.

I see.  My concern is / was that cluster operators will be reluctant to
upgrade to 2.0, meaning that developers using those clusters need to stay
on 1.x, and, if they want to move to DataFrames, essentially need to port
their app twice.

I misunderstood and thought part of the proposal was to drop support for
2.10 though.  If your broad point is that there aren't changes in 2.0 that
will make it less palatable to cluster administrators than releases in the
1.x line, then yes, 2.0 as the next release sounds fine to me.

-Sandy


On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> What are the other breaking changes in 2.0 though? Note that we're not
> removing Scala 2.10, we're just making the default build be against Scala
> 2.11 instead of 2.10. There seem to be very few changes that people would
> worry about. If people are going to update their apps, I think it's better
> to make the other small changes in 2.0 at the same time than to update once
> for Dataset and another time for 2.0.
>
> BTW just refer to Reynold's original post for the other proposed API
> changes.
>
> Matei
>
> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>
> I think that Kostas' logic still holds.  The majority of Spark users, and
> likely an even vaster majority of people running vaster jobs, are still on
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
> to upgrade to the stable version of the Dataset / DataFrame API so they
> don't need to do so twice.  Requiring that they absorb all the other ways
> that Spark breaks compatibility in the move to 2.0 makes it much more
> difficult for them to make this transition.
>
> Using the same set of APIs also means that it will be easier to backport
> critical fixes to the 1.x line.
>
> It's not clear to me that avoiding breakage of an experimental API in the
> 1.x line outweighs these issues.
>
> -Sandy
>
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I actually think the next one (after 1.6) should be Spark 2.0. The reason
>> is that I already know we have to break some part of the DataFrame/Dataset
>> API as part of the Dataset design. (e.g. DataFrame.map should return
>> Dataset rather than RDD). In that case, I'd rather break this sooner (in
>> one release) than later (in two releases). so the damage is smaller.
>>
>> I don't think whether we call Dataset/DataFrame experimental or not
>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>> then mark them as stable in 2.1. Despite being "experimental", there has
>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>
>>
>>
>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>>> fixing.  We're on the same page now.
>>>
>>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>>> wrote:
>>>
>>>> A 1.6.x release will only fix bugs - we typically don't change APIs in
>>>> z releases. The Dataset API is experimental and so we might be changing the
>>>> APIs before we declare it stable. This is why I think it is important to
>>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>>
>>>> Kostas
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <mark@clearstorydata.com
>>>> > wrote:
>>>>
>>>>> Why does stabilization of those two features require a 1.7 release
>>>>> instead of 1.6.1?
>>>>>
>>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kostas@cloudera.com
>>>>> > wrote:
>>>>>
>>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>>
>>>>>> 1) the experimental Datasets API
>>>>>> 2) the new unified memory manager.
>>>>>>
>>>>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>>>>> but there will be users that won't be able to seamlessly upgrade given what
>>>>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>>>>> release with these new features/APIs stabilized will be very beneficial.
>>>>>> This might make Spark 1.7 a lighter release but that is not necessarily a
>>>>>> bad thing.
>>>>>>
>>>>>> Any thoughts on this timeline?
>>>>>>
>>>>>> Kostas Sakellis
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>>> *To:* Stephen Boesch
>>>>>>>
>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... Until
>>>>>>> the optimizer is more fully developed, that won't always get you the best
>>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>>> you may want to use the low-level RDD API while setting
>>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> An example is the use of groupBy when we know that the source
>>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>>> we can use preservesPartitioning=true.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>>
>>>>>>> The place of the RDD API in 2.0 is also something I've been
>>>>>>> wondering about.  I think it may be going too far to deprecate it, but
>>>>>>> changing emphasis is something that we might consider.  The RDD API came
>>>>>>> well before DataFrames and DataSets, so programming guides, introductory
>>>>>>> how-to articles and the like have, to this point, also tended to emphasize
>>>>>>> RDDs -- or at least to deal with them early.  What I'm thinking is that
>>>>>>> with 2.0 maybe we should overhaul all the documentation to de-emphasize and
>>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>>>> DataSets don't fully fit.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>>> lots of users.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>>> DataFrame or DataSet.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>>> *To:* Nicholas Chammas
>>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>>>> Reynold Xin
>>>>>>>
>>>>>>>
>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>>>> user programs. I propose we build on
>>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>>> direction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Kostas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>
>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>
>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>> Tungsten.
>>>>>>>
>>>>>>> On that note of deprecating stuff, it might be good to deprecate
>>>>>>> some things in 2.0 without removing or replacing them immediately. That way
>>>>>>> 2.0 doesn’t have to wait for everything that we want to deprecate to be
>>>>>>> replaced all at once.
>>>>>>>
>>>>>>> Nick
>>>>>>>
>>>>>>> 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>>
>>>>>>> Parameter Server is a new feature and thus does not match the goal
>>>>>>> of 2.0 is “to fix things that are broken in the current API and remove
>>>>>>> certain deprecated APIs”. At the same time I would be happy to have that
>>>>>>> feature.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>>
>>>>>>> With regards to GraphX, it would be great to deprecate the use of
>>>>>>> RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>>> Tungsten.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best regards, Alexander
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>>> *To:* witgo@qq.com
>>>>>>> *Cc:* dev@spark.apache.org
>>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Being specific to Parameter Server, I think the current agreement is
>>>>>>> that PS shall exist as a third-party library instead of a component of the
>>>>>>> core code base, isn’t?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Nan Zhu
>>>>>>>
>>>>>>> http://codingcat.me
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>>
>>>>>>> Who has the idea of machine learning? Spark missing some features
>>>>>>> for machine learning, For example, the parameter server.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>>> to reduce the number of dependencies. In the future, it might even be
>>>>>>> useful to do this for Hadoop, but it requires too many API changes to be
>>>>>>> worth doing now.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regarding Scala 2.12, we should definitely support it eventually,
>>>>>>> but I don't think we need to block 2.0 on that because it can be added
>>>>>>> later too. Has anyone investigated what it would take to run on there? I
>>>>>>> imagine we don't need many code changes, just maybe some REPL stuff.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Needless to say, but I'm all for the idea of making "major" releases
>>>>>>> as undisruptive as possible in the model Reynold proposed. Keeping everyone
>>>>>>> working with the same set of releases is super important.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> to the Spark community. A major release should not be very different
>>>>>>> from a
>>>>>>>
>>>>>>> minor release and should not be gated based on new features. The main
>>>>>>>
>>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>>> broken
>>>>>>>
>>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>>> follow).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>>
>>>>>>> time to replace some big old API or implementation with a new one,
>>>>>>> but
>>>>>>>
>>>>>>> I don't see obvious candidates.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>>
>>>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>>>
>>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>>> 2.10, but
>>>>>>>
>>>>>>> it has been end-of-life.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11
>>>>>>> will
>>>>>>>
>>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>>>>
>>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. Remove Hadoop 1 support.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>>
>>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>>
>>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>>
>>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>>>>
>>>>>>> Pop out any Docker stuff to another repo?
>>>>>>>
>>>>>>> Continue that same effort for EC2?
>>>>>>>
>>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>>
>>>>>>> controversial)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: A proposal for Spark 2.0

Posted by Matei Zaharia <ma...@gmail.com>.

What are the other breaking changes in 2.0 though? Note that we're not removing Scala 2.10, we're just making the default build be against Scala 2.11 instead of 2.10. There seem to be very few changes that people would worry about. If people are going to update their apps, I think it's better to make the other small changes in 2.0 at the same time than to update once for Dataset and another time for 2.0.

BTW just refer to Reynold's original post for the other proposed API changes.

Matei

> On Nov 24, 2015, at 12:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> 
> I think that Kostas' logic still holds.  The majority of Spark users, and likely an even vaster majority of people running vaster jobs, are still on RDDs and on the cusp of upgrading to DataFrames.  Users will probably want to upgrade to the stable version of the Dataset / DataFrame API so they don't need to do so twice.  Requiring that they absorb all the other ways that Spark breaks compatibility in the move to 2.0 makes it much more difficult for them to make this transition.
> 
> Using the same set of APIs also means that it will be easier to backport critical fixes to the 1.x line.
> 
> It's not clear to me that avoiding breakage of an experimental API in the 1.x line outweighs these issues.
> 
> -Sandy 
> 
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> I actually think the next one (after 1.6) should be Spark 2.0. The reason is that I already know we have to break some part of the DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map should return Dataset rather than RDD). In that case, I'd rather break this sooner (in one release) than later (in two releases). so the damage is smaller.
> 
> I don't think whether we call Dataset/DataFrame experimental or not matters too much for 2.0. We can still call Dataset experimental in 2.0 and then mark them as stable in 2.1. Despite being "experimental", there has been no breaking changes to DataFrame from 1.3 to 1.6.
> 
> 
> 
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <mark@clearstorydata.com <ma...@clearstorydata.com>> wrote:
> Ah, got it; by "stabilize" you meant changing the API, not just bug fixing.  We're on the same page now.
> 
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <kostas@cloudera.com <ma...@cloudera.com>> wrote:
> A 1.6.x release will only fix bugs - we typically don't change APIs in z releases. The Dataset API is experimental and so we might be changing the APIs before we declare it stable. This is why I think it is important to first stabilize the Dataset API with a Spark 1.7 release before moving to Spark 2.0. This will benefit users that would like to use the new Dataset APIs but can't move to Spark 2.0 because of the backwards incompatible changes, like removal of deprecated APIs, Scala 2.11 etc.
> 
> Kostas
> 
> 
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <mark@clearstorydata.com <ma...@clearstorydata.com>> wrote:
> Why does stabilization of those two features require a 1.7 release instead of 1.6.1?
> 
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <kostas@cloudera.com <ma...@cloudera.com>> wrote:
> We have veered off the topic of Spark 2.0 a little bit here - yes we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to propose we have one more 1.x release after Spark 1.6. This will allow us to stabilize a few of the new features that were added in 1.6:
> 
> 1) the experimental Datasets API
> 2) the new unified memory manager.
> 
> I understand our goal for Spark 2.0 is to offer an easy transition but there will be users that won't be able to seamlessly upgrade given what we have discussed as in scope for 2.0. For these users, having a 1.x release with these new features/APIs stabilized will be very beneficial. This might make Spark 1.7 a lighter release but that is not necessarily a bad thing.
> 
> Any thoughts on this timeline?
> 
> Kostas Sakellis
> 
> 
> 
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.cheng@intel.com <ma...@intel.com>> wrote:
> Agree, more features/apis/optimization need to be added in DF/DS.
> 
>  
> 
> I mean, we need to think about what kind of RDD APIs we have to provide to developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as we can do the same thing easily with DF/DS, even better performance.
> 
>   <>
> From: Mark Hamstra [mailto:mark@clearstorydata.com <ma...@clearstorydata.com>] 
> Sent: Friday, November 13, 2015 11:23 AM
> To: Stephen Boesch
> 
> 
> Cc: dev@spark.apache.org <ma...@spark.apache.org>
> Subject: Re: A proposal for Spark 2.0
> 
>  
> 
> Hmmm... to me, that seems like precisely the kind of thing that argues for retaining the RDD API but not as the first thing presented to new Spark developers: "Here's how to use groupBy with DataFrames.... Until the optimizer is more fully developed, that won't always get you the best performance that can be obtained.  In these particular circumstances, ..., you may want to use the low-level RDD API while setting preservesPartitioning to true.  Like this...."
> 
>  
> 
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <javadba@gmail.com <ma...@gmail.com>> wrote:
> 
> My understanding is that  the RDD's presently have more support for complete control of partitioning which is a key consideration at scale.  While partitioning control is still piecemeal in  DF/DS  it would seem premature to make RDD's a second-tier approach to spark dev.    
> 
>  
> 
> An example is the use of groupBy when we know that the source relation (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark sql still does not allow that knowledge to be applied to the optimizer - so a full shuffle will be performed. However in the native RDD we can use preservesPartitioning=true.
> 
>  
> 
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <mark@clearstorydata.com <ma...@clearstorydata.com>>:
> 
> The place of the RDD API in 2.0 is also something I've been wondering about.  I think it may be going too far to deprecate it, but changing emphasis is something that we might consider.  The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and the like have, to this point, also tended to emphasize RDDs -- or at least to deal with them early.  What I'm thinking is that with 2.0 maybe we should overhaul all the documentation to de-emphasize and reposition RDDs.  In this scheme, DataFrames and DataSets would be introduced and fully addressed before RDDs.  They would be presented as the normal/default/standard way to do things in Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, closer-to-the-metal API that can be used in atypical, more specialized contexts where DataFrames or DataSets don't fully fit. 
> 
>  
> 
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.cheng@intel.com <ma...@intel.com>> wrote:
> 
> I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users.
> 
>  
> 
> By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with DataFrame or DataSet.
> 
>   <>
> Hao
> 
>  
> 
> From: Kostas Sakellis [mailto:kostas@cloudera.com <ma...@cloudera.com>] 
> Sent: Friday, November 13, 2015 5:27 AM
> To: Nicholas Chammas
> Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com <ma...@qq.com>; dev@spark.apache.org <ma...@spark.apache.org>; Reynold Xin
> 
> 
> Subject: Re: A proposal for Spark 2.0
> 
>  
> 
> I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction.
> 
>  
> 
> Kostas
> 
>  
> 
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <nicholas.chammas@gmail.com <ma...@gmail.com>> wrote:
> 
> With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.
> 
> With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
> 
> On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once.
> 
> Nick
> 
> 
> 
>  
> 
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <alexander.ulanov@hpe.com <ma...@hpe.com>> wrote:
> 
> Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature.
> 
>  
> 
> With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.
> 
> With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
> 
>  
> 
> Best regards, Alexander
> 
>  
> 
> From: Nan Zhu [mailto:zhunanmcgill@gmail.com <ma...@gmail.com>] 
> Sent: Thursday, November 12, 2015 7:28 AM
> To: witgo@qq.com <ma...@qq.com>
> Cc: dev@spark.apache.org <ma...@spark.apache.org>
> Subject: Re: A proposal for Spark 2.0
> 
>  
> 
> Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?
> 
>  
> 
> Best,
> 
>  
> 
> -- 
> 
> Nan Zhu
> 
> http://codingcat.me <http://codingcat.me/>
>  
> 
> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com <ma...@qq.com> wrote:
> 
> Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.
> 
>  
> 
>  
> 
> 在 2015年11月12日，05:32，Matei Zaharia <matei.zaharia@gmail.com <ma...@gmail.com>> 写道：
> 
>  
> 
> I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.
> 
>  
> 
> Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.
> 
>  
> 
> Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.
> 
>  
> 
> Matei
> 
>  
> 
> On Nov 11, 2015, at 4:58 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> 
>  
> 
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> 
> to the Spark community. A major release should not be very different from a
> 
> minor release and should not be gated based on new features. The main
> 
> purpose of a major release is an opportunity to fix things that are broken
> 
> in the current API and remove certain deprecated APIs (examples follow).
> 
>  
> 
> Agree with this stance. Generally, a major release might also be a
> 
> time to replace some big old API or implementation with a new one, but
> 
> I don't see obvious candidates.
> 
>  
> 
> I wouldn't mind turning attention to 2.x sooner than later, unless
> 
> there's a fairly good reason to continue adding features in 1.x to a
> 
> 1.7 release. The scope as of 1.6 is already pretty darned big.
> 
>  
> 
>  
> 
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> 
> it has been end-of-life.
> 
>  
> 
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> 
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> 
> dropping 2.10. Otherwise it's supported for 2 more years.
> 
>  
> 
>  
> 
> 2. Remove Hadoop 1 support.
> 
>  
> 
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> 
> sort of 'alpha' and 'beta' releases) and even <2.6.
> 
>  
> 
> I'm sure we'll think of a number of other small things -- shading a
> 
> bunch of stuff? reviewing and updating dependencies in light of
> 
> simpler, more recent dependencies to support from Hadoop etc?
> 
>  
> 
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> 
> Pop out any Docker stuff to another repo?
> 
> Continue that same effort for EC2?
> 
> Farming out some of the "external" integrations to another repo (?
> 
> controversial)
> 
>  
> 
> See also anything marked version "2+" in JIRA.
> 
>  
> 
> ---------------------------------------------------------------------
> 
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>  
> 
>  
> 
> ---------------------------------------------------------------------
> 
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>  
> 
>  
> 
>  
> 
>  
> 
> ---------------------------------------------------------------------
> 
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> 
> 
> 
> 
>

Re: A proposal for Spark 2.0

Posted by Sandy Ryza <sa...@cloudera.com>.

I think that Kostas' logic still holds.  The majority of Spark users, and
likely an even vaster majority of people running vaster jobs, are still on
RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
to upgrade to the stable version of the Dataset / DataFrame API so they
don't need to do so twice.  Requiring that they absorb all the other ways
that Spark breaks compatibility in the move to 2.0 makes it much more
difficult for them to make this transition.

Using the same set of APIs also means that it will be easier to backport
critical fixes to the 1.x line.

It's not clear to me that avoiding breakage of an experimental API in the
1.x line outweighs these issues.

-Sandy

On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin <rx...@databricks.com> wrote:

> I actually think the next one (after 1.6) should be Spark 2.0. The reason
> is that I already know we have to break some part of the DataFrame/Dataset
> API as part of the Dataset design. (e.g. DataFrame.map should return
> Dataset rather than RDD). In that case, I'd rather break this sooner (in
> one release) than later (in two releases). so the damage is smaller.
>
> I don't think whether we call Dataset/DataFrame experimental or not
> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> then mark them as stable in 2.1. Despite being "experimental", there has
> been no breaking changes to DataFrame from 1.3 to 1.6.
>
>
>
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> fixing.  We're on the same page now.
>>
>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
>> wrote:
>>
>>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>>> releases. The Dataset API is experimental and so we might be changing the
>>> APIs before we declare it stable. This is why I think it is important to
>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>
>>> Kostas
>>>
>>>
>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> Why does stabilization of those two features require a 1.7 release
>>>> instead of 1.6.1?
>>>>
>>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <ko...@cloudera.com>
>>>> wrote:
>>>>
>>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>>> to stabilize a few of the new features that were added in 1.6:
>>>>>
>>>>> 1) the experimental Datasets API
>>>>> 2) the new unified memory manager.
>>>>>
>>>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>>>> there will be users that won't be able to seamlessly upgrade given what we
>>>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>>>> with these new features/APIs stabilized will be very beneficial. This might
>>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>>
>>>>> Any thoughts on this timeline?
>>>>>
>>>>> Kostas Sakellis
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>>> wrote:
>>>>>
>>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>>> *To:* Stephen Boesch
>>>>>>
>>>>>> *Cc:* dev@spark.apache.org
>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hmmm... to me, that seems like precisely the kind of thing that
>>>>>> argues for retaining the RDD API but not as the first thing presented to
>>>>>> new Spark developers: "Here's how to use groupBy with DataFrames.... Until
>>>>>> the optimizer is more fully developed, that won't always get you the best
>>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>>> you may want to use the low-level RDD API while setting
>>>>>> preservesPartitioning to true.  Like this...."
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> My understanding is that  the RDD's presently have more support for
>>>>>> complete control of partitioning which is a key consideration at scale.
>>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>>
>>>>>>
>>>>>>
>>>>>> An example is the use of groupBy when we know that the source
>>>>>> relation (/RDD) is already partitioned on the grouping expressions.  AFAIK
>>>>>> the spark sql still does not allow that knowledge to be applied to the
>>>>>> optimizer - so a full shuffle will be performed. However in the native RDD
>>>>>> we can use preservesPartitioning=true.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>>
>>>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>>>> about.  I think it may be going too far to deprecate it, but changing
>>>>>> emphasis is something that we might consider.  The RDD API came well before
>>>>>> DataFrames and DataSets, so programming guides, introductory how-to
>>>>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>>>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>>>>> maybe we should overhaul all the documentation to de-emphasize and
>>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>>> DataSets don't fully fit.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> I am not sure what the best practice for this specific problem, but
>>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>>> lots of users.
>>>>>>
>>>>>>
>>>>>>
>>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>>> DataFrame or DataSet.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hao
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>>> *To:* Nicholas Chammas
>>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>>> Reynold Xin
>>>>>>
>>>>>>
>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>>> user programs. I propose we build on
>>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>>> if we start requiring user apps to explicitly pull in all their
>>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>>> direction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>
>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>
>>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>> Tungsten.
>>>>>>
>>>>>> On that note of deprecating stuff, it might be good to deprecate some
>>>>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>>>>> doesn’t have to wait for everything that we want to deprecate to be
>>>>>> replaced all at once.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> 
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>>
>>>>>> Parameter Server is a new feature and thus does not match the goal of
>>>>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>>>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>>>>
>>>>>>
>>>>>>
>>>>>> With regards to Machine learning, it would be great to move useful
>>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>>
>>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>>> Tungsten.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards, Alexander
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>>> *To:* witgo@qq.com
>>>>>> *Cc:* dev@spark.apache.org
>>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Being specific to Parameter Server, I think the current agreement is
>>>>>> that PS shall exist as a third-party library instead of a component of the
>>>>>> core code base, isn’t?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Nan Zhu
>>>>>>
>>>>>> http://codingcat.me
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>>
>>>>>> Who has the idea of machine learning? Spark missing some features for
>>>>>> machine learning, For example, the parameter server.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>>
>>>>>>
>>>>>>
>>>>>> I like the idea of popping out Tachyon to an optional component too
>>>>>> to reduce the number of dependencies. In the future, it might even be
>>>>>> useful to do this for Hadoop, but it requires too many API changes to be
>>>>>> worth doing now.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regarding Scala 2.12, we should definitely support it eventually, but
>>>>>> I don't think we need to block 2.0 on that because it can be added later
>>>>>> too. Has anyone investigated what it would take to run on there? I imagine
>>>>>> we don't need many code changes, just maybe some REPL stuff.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Needless to say, but I'm all for the idea of making "major" releases
>>>>>> as undisruptive as possible in the model Reynold proposed. Keeping everyone
>>>>>> working with the same set of releases is super important.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> to the Spark community. A major release should not be very different
>>>>>> from a
>>>>>>
>>>>>> minor release and should not be gated based on new features. The main
>>>>>>
>>>>>> purpose of a major release is an opportunity to fix things that are
>>>>>> broken
>>>>>>
>>>>>> in the current API and remove certain deprecated APIs (examples
>>>>>> follow).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>>
>>>>>> time to replace some big old API or implementation with a new one, but
>>>>>>
>>>>>> I don't see obvious candidates.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>>
>>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>>
>>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>>> 2.10, but
>>>>>>
>>>>>> it has been end-of-life.
>>>>>>
>>>>>>
>>>>>>
>>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>>>>
>>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>>>
>>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2. Remove Hadoop 1 support.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>>
>>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>>
>>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>>
>>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>>>
>>>>>> Pop out any Docker stuff to another repo?
>>>>>>
>>>>>> Continue that same effort for EC2?
>>>>>>
>>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>>
>>>>>> controversial)
>>>>>>
>>>>>>
>>>>>>
>>>>>> See also anything marked version "2+" in JIRA.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

I actually think the next one (after 1.6) should be Spark 2.0. The reason
is that I already know we have to break some part of the DataFrame/Dataset
API as part of the Dataset design. (e.g. DataFrame.map should return
Dataset rather than RDD). In that case, I'd rather break this sooner (in
one release) than later (in two releases). so the damage is smaller.

I don't think whether we call Dataset/DataFrame experimental or not matters
too much for 2.0. We can still call Dataset experimental in 2.0 and then
mark them as stable in 2.1. Despite being "experimental", there has been no
breaking changes to DataFrame from 1.3 to 1.6.



On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Ah, got it; by "stabilize" you meant changing the API, not just bug
> fixing.  We're on the same page now.
>
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
> wrote:
>
>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>> releases. The Dataset API is experimental and so we might be changing the
>> APIs before we declare it stable. This is why I think it is important to
>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>> Spark 2.0. This will benefit users that would like to use the new Dataset
>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>
>> Kostas
>>
>>
>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> Why does stabilization of those two features require a 1.7 release
>>> instead of 1.6.1?
>>>
>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <ko...@cloudera.com>
>>> wrote:
>>>
>>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
>>>> to propose we have one more 1.x release after Spark 1.6. This will allow us
>>>> to stabilize a few of the new features that were added in 1.6:
>>>>
>>>> 1) the experimental Datasets API
>>>> 2) the new unified memory manager.
>>>>
>>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>>> there will be users that won't be able to seamlessly upgrade given what we
>>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>>> with these new features/APIs stabilized will be very beneficial. This might
>>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>>
>>>> Any thoughts on this timeline?
>>>>
>>>> Kostas Sakellis
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com>
>>>> wrote:
>>>>
>>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>>
>>>>>
>>>>>
>>>>> I mean, we need to think about what kind of RDD APIs we have to
>>>>> provide to developer, maybe the fundamental API is enough, like, the
>>>>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>>>>> we can do the same thing easily with DF/DS, even better performance.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>>> *To:* Stephen Boesch
>>>>>
>>>>> *Cc:* dev@spark.apache.org
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>>>> for retaining the RDD API but not as the first thing presented to new Spark
>>>>> developers: "Here's how to use groupBy with DataFrames.... Until the
>>>>> optimizer is more fully developed, that won't always get you the best
>>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>>> you may want to use the low-level RDD API while setting
>>>>> preservesPartitioning to true.  Like this...."
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> My understanding is that  the RDD's presently have more support for
>>>>> complete control of partitioning which is a key consideration at scale.
>>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>>
>>>>>
>>>>>
>>>>> An example is the use of groupBy when we know that the source relation
>>>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>>>> a full shuffle will be performed. However in the native RDD we can use
>>>>> preservesPartitioning=true.
>>>>>
>>>>>
>>>>>
>>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>>
>>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>>> about.  I think it may be going too far to deprecate it, but changing
>>>>> emphasis is something that we might consider.  The RDD API came well before
>>>>> DataFrames and DataSets, so programming guides, introductory how-to
>>>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>>>> maybe we should overhaul all the documentation to de-emphasize and
>>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>>> DataSets don't fully fit.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>>> wrote:
>>>>>
>>>>> I am not sure what the best practice for this specific problem, but
>>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>>> lots of users.
>>>>>
>>>>>
>>>>>
>>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>>> internal API only?)? As lots of its functionality overlapping with
>>>>> DataFrame or DataSet.
>>>>>
>>>>>
>>>>>
>>>>> Hao
>>>>>
>>>>>
>>>>>
>>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>>> *To:* Nicholas Chammas
>>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>>> Reynold Xin
>>>>>
>>>>>
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>>> user programs. I propose we build on
>>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>>> not allow any spark transitive dependencies to leak into user code. For
>>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>>> if we start requiring user apps to explicitly pull in all their
>>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>>> direction.
>>>>>
>>>>>
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>> With regards to Machine learning, it would be great to move useful
>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>
>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>> Tungsten.
>>>>>
>>>>> On that note of deprecating stuff, it might be good to deprecate some
>>>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>>>> doesn’t have to wait for everything that we want to deprecate to be
>>>>> replaced all at once.
>>>>>
>>>>> Nick
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>>> alexander.ulanov@hpe.com> wrote:
>>>>>
>>>>> Parameter Server is a new feature and thus does not match the goal of
>>>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>>>
>>>>>
>>>>>
>>>>> With regards to Machine learning, it would be great to move useful
>>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>>
>>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>>> Tungsten.
>>>>>
>>>>>
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>>
>>>>>
>>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>>> *To:* witgo@qq.com
>>>>> *Cc:* dev@spark.apache.org
>>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>>
>>>>>
>>>>>
>>>>> Being specific to Parameter Server, I think the current agreement is
>>>>> that PS shall exist as a third-party library instead of a component of the
>>>>> core code base, isn’t?
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Nan Zhu
>>>>>
>>>>> http://codingcat.me
>>>>>
>>>>>
>>>>>
>>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>>
>>>>> Who has the idea of machine learning? Spark missing some features for
>>>>> machine learning, For example, the parameter server.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>>
>>>>>
>>>>>
>>>>> I like the idea of popping out Tachyon to an optional component too to
>>>>> reduce the number of dependencies. In the future, it might even be useful
>>>>> to do this for Hadoop, but it requires too many API changes to be worth
>>>>> doing now.
>>>>>
>>>>>
>>>>>
>>>>> Regarding Scala 2.12, we should definitely support it eventually, but
>>>>> I don't think we need to block 2.0 on that because it can be added later
>>>>> too. Has anyone investigated what it would take to run on there? I imagine
>>>>> we don't need many code changes, just maybe some REPL stuff.
>>>>>
>>>>>
>>>>>
>>>>> Needless to say, but I'm all for the idea of making "major" releases
>>>>> as undisruptive as possible in the model Reynold proposed. Keeping everyone
>>>>> working with the same set of releases is super important.
>>>>>
>>>>>
>>>>>
>>>>> Matei
>>>>>
>>>>>
>>>>>
>>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> to the Spark community. A major release should not be very different
>>>>> from a
>>>>>
>>>>> minor release and should not be gated based on new features. The main
>>>>>
>>>>> purpose of a major release is an opportunity to fix things that are
>>>>> broken
>>>>>
>>>>> in the current API and remove certain deprecated APIs (examples
>>>>> follow).
>>>>>
>>>>>
>>>>>
>>>>> Agree with this stance. Generally, a major release might also be a
>>>>>
>>>>> time to replace some big old API or implementation with a new one, but
>>>>>
>>>>> I don't see obvious candidates.
>>>>>
>>>>>
>>>>>
>>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>>
>>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>>
>>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>>> 2.10, but
>>>>>
>>>>> it has been end-of-life.
>>>>>
>>>>>
>>>>>
>>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>>>
>>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>>
>>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2. Remove Hadoop 1 support.
>>>>>
>>>>>
>>>>>
>>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>>
>>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>>
>>>>>
>>>>>
>>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>>
>>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>>
>>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>>
>>>>>
>>>>>
>>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>>
>>>>> Pop out any Docker stuff to another repo?
>>>>>
>>>>> Continue that same effort for EC2?
>>>>>
>>>>> Farming out some of the "external" integrations to another repo (?
>>>>>
>>>>> controversial)
>>>>>
>>>>>
>>>>>
>>>>> See also anything marked version "2+" in JIRA.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Ah, got it; by "stabilize" you meant changing the API, not just bug
fixing.  We're on the same page now.

On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis <ko...@cloudera.com>
wrote:

> A 1.6.x release will only fix bugs - we typically don't change APIs in z
> releases. The Dataset API is experimental and so we might be changing the
> APIs before we declare it stable. This is why I think it is important to
> first stabilize the Dataset API with a Spark 1.7 release before moving to
> Spark 2.0. This will benefit users that would like to use the new Dataset
> APIs but can't move to Spark 2.0 because of the backwards incompatible
> changes, like removal of deprecated APIs, Scala 2.11 etc.
>
> Kostas
>
>
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> Why does stabilization of those two features require a 1.7 release
>> instead of 1.6.1?
>>
>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <ko...@cloudera.com>
>> wrote:
>>
>>> We have veered off the topic of Spark 2.0 a little bit here - yes we can
>>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>>> stabilize a few of the new features that were added in 1.6:
>>>
>>> 1) the experimental Datasets API
>>> 2) the new unified memory manager.
>>>
>>> I understand our goal for Spark 2.0 is to offer an easy transition but
>>> there will be users that won't be able to seamlessly upgrade given what we
>>> have discussed as in scope for 2.0. For these users, having a 1.x release
>>> with these new features/APIs stabilized will be very beneficial. This might
>>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>>
>>> Any thoughts on this timeline?
>>>
>>> Kostas Sakellis
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>>
>>>>
>>>>
>>>> I mean, we need to think about what kind of RDD APIs we have to provide
>>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>>>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>>>> same thing easily with DF/DS, even better performance.
>>>>
>>>>
>>>>
>>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>>> *To:* Stephen Boesch
>>>>
>>>> *Cc:* dev@spark.apache.org
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>>> for retaining the RDD API but not as the first thing presented to new Spark
>>>> developers: "Here's how to use groupBy with DataFrames.... Until the
>>>> optimizer is more fully developed, that won't always get you the best
>>>> performance that can be obtained.  In these particular circumstances, ...,
>>>> you may want to use the low-level RDD API while setting
>>>> preservesPartitioning to true.  Like this...."
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> My understanding is that  the RDD's presently have more support for
>>>> complete control of partitioning which is a key consideration at scale.
>>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>>> premature to make RDD's a second-tier approach to spark dev.
>>>>
>>>>
>>>>
>>>> An example is the use of groupBy when we know that the source relation
>>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>>> a full shuffle will be performed. However in the native RDD we can use
>>>> preservesPartitioning=true.
>>>>
>>>>
>>>>
>>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>>
>>>> The place of the RDD API in 2.0 is also something I've been wondering
>>>> about.  I think it may be going too far to deprecate it, but changing
>>>> emphasis is something that we might consider.  The RDD API came well before
>>>> DataFrames and DataSets, so programming guides, introductory how-to
>>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>>> maybe we should overhaul all the documentation to de-emphasize and
>>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>>> introduced and fully addressed before RDDs.  They would be presented as the
>>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>>> that can be used in atypical, more specialized contexts where DataFrames or
>>>> DataSets don't fully fit.
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>
>>>> wrote:
>>>>
>>>> I am not sure what the best practice for this specific problem, but
>>>> it’s really worth to think about it in 2.0, as it is a painful issue for
>>>> lots of users.
>>>>
>>>>
>>>>
>>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>>> internal API only?)? As lots of its functionality overlapping with
>>>> DataFrame or DataSet.
>>>>
>>>>
>>>>
>>>> Hao
>>>>
>>>>
>>>>
>>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>>> *To:* Nicholas Chammas
>>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>>> Reynold Xin
>>>>
>>>>
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> I know we want to keep breaking changes to a minimum but I'm hoping
>>>> that with Spark 2.0 we can also look at better classpath isolation with
>>>> user programs. I propose we build on
>>>> spark.{driver|executor}.userClassPathFirst, setting it true by default, and
>>>> not allow any spark transitive dependencies to leak into user code. For
>>>> backwards compatibility we can have a whitelist if we want but I'd be good
>>>> if we start requiring user apps to explicitly pull in all their
>>>> dependencies. From what I can tell, Hadoop 3 is also moving in this
>>>> direction.
>>>>
>>>>
>>>>
>>>> Kostas
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>>> nicholas.chammas@gmail.com> wrote:
>>>>
>>>> With regards to Machine learning, it would be great to move useful
>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>
>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>> Tungsten.
>>>>
>>>> On that note of deprecating stuff, it might be good to deprecate some
>>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>>> doesn’t have to wait for everything that we want to deprecate to be
>>>> replaced all at once.
>>>>
>>>> Nick
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>>> alexander.ulanov@hpe.com> wrote:
>>>>
>>>> Parameter Server is a new feature and thus does not match the goal of
>>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>>
>>>>
>>>>
>>>> With regards to Machine learning, it would be great to move useful
>>>> features from MLlib to ML and deprecate the former. Current structure of
>>>> two separate machine learning packages seems to be somewhat confusing.
>>>>
>>>> With regards to GraphX, it would be great to deprecate the use of RDD
>>>> in GraphX and switch to Dataframe. This will allow GraphX evolve with
>>>> Tungsten.
>>>>
>>>>
>>>>
>>>> Best regards, Alexander
>>>>
>>>>
>>>>
>>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>>> *To:* witgo@qq.com
>>>> *Cc:* dev@spark.apache.org
>>>> *Subject:* Re: A proposal for Spark 2.0
>>>>
>>>>
>>>>
>>>> Being specific to Parameter Server, I think the current agreement is
>>>> that PS shall exist as a third-party library instead of a component of the
>>>> core code base, isn’t?
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Nan Zhu
>>>>
>>>> http://codingcat.me
>>>>
>>>>
>>>>
>>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>>
>>>> Who has the idea of machine learning? Spark missing some features for
>>>> machine learning, For example, the parameter server.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>>
>>>>
>>>>
>>>> I like the idea of popping out Tachyon to an optional component too to
>>>> reduce the number of dependencies. In the future, it might even be useful
>>>> to do this for Hadoop, but it requires too many API changes to be worth
>>>> doing now.
>>>>
>>>>
>>>>
>>>> Regarding Scala 2.12, we should definitely support it eventually, but I
>>>> don't think we need to block 2.0 on that because it can be added later too.
>>>> Has anyone investigated what it would take to run on there? I imagine we
>>>> don't need many code changes, just maybe some REPL stuff.
>>>>
>>>>
>>>>
>>>> Needless to say, but I'm all for the idea of making "major" releases as
>>>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>>>> working with the same set of releases is super important.
>>>>
>>>>
>>>>
>>>> Matei
>>>>
>>>>
>>>>
>>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> to the Spark community. A major release should not be very different
>>>> from a
>>>>
>>>> minor release and should not be gated based on new features. The main
>>>>
>>>> purpose of a major release is an opportunity to fix things that are
>>>> broken
>>>>
>>>> in the current API and remove certain deprecated APIs (examples follow).
>>>>
>>>>
>>>>
>>>> Agree with this stance. Generally, a major release might also be a
>>>>
>>>> time to replace some big old API or implementation with a new one, but
>>>>
>>>> I don't see obvious candidates.
>>>>
>>>>
>>>>
>>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>>
>>>> there's a fairly good reason to continue adding features in 1.x to a
>>>>
>>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but
>>>>
>>>> it has been end-of-life.
>>>>
>>>>
>>>>
>>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>>
>>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>>
>>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>>
>>>>
>>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>>
>>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>>
>>>>
>>>>
>>>> I'm sure we'll think of a number of other small things -- shading a
>>>>
>>>> bunch of stuff? reviewing and updating dependencies in light of
>>>>
>>>> simpler, more recent dependencies to support from Hadoop etc?
>>>>
>>>>
>>>>
>>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>>
>>>> Pop out any Docker stuff to another repo?
>>>>
>>>> Continue that same effort for EC2?
>>>>
>>>> Farming out some of the "external" integrations to another repo (?
>>>>
>>>> controversial)
>>>>
>>>>
>>>>
>>>> See also anything marked version "2+" in JIRA.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Kostas Sakellis <ko...@cloudera.com>.

A 1.6.x release will only fix bugs - we typically don't change APIs in z
releases. The Dataset API is experimental and so we might be changing the
APIs before we declare it stable. This is why I think it is important to
first stabilize the Dataset API with a Spark 1.7 release before moving to
Spark 2.0. This will benefit users that would like to use the new Dataset
APIs but can't move to Spark 2.0 because of the backwards incompatible
changes, like removal of deprecated APIs, Scala 2.11 etc.

Kostas


On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Why does stabilization of those two features require a 1.7 release instead
> of 1.6.1?
>
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <ko...@cloudera.com>
> wrote:
>
>> We have veered off the topic of Spark 2.0 a little bit here - yes we can
>> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
>> propose we have one more 1.x release after Spark 1.6. This will allow us to
>> stabilize a few of the new features that were added in 1.6:
>>
>> 1) the experimental Datasets API
>> 2) the new unified memory manager.
>>
>> I understand our goal for Spark 2.0 is to offer an easy transition but
>> there will be users that won't be able to seamlessly upgrade given what we
>> have discussed as in scope for 2.0. For these users, having a 1.x release
>> with these new features/APIs stabilized will be very beneficial. This might
>> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>>
>> Any thoughts on this timeline?
>>
>> Kostas Sakellis
>>
>>
>>
>> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>> Agree, more features/apis/optimization need to be added in DF/DS.
>>>
>>>
>>>
>>> I mean, we need to think about what kind of RDD APIs we have to provide
>>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>>> same thing easily with DF/DS, even better performance.
>>>
>>>
>>>
>>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>>> *Sent:* Friday, November 13, 2015 11:23 AM
>>> *To:* Stephen Boesch
>>>
>>> *Cc:* dev@spark.apache.org
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> Hmmm... to me, that seems like precisely the kind of thing that argues
>>> for retaining the RDD API but not as the first thing presented to new Spark
>>> developers: "Here's how to use groupBy with DataFrames.... Until the
>>> optimizer is more fully developed, that won't always get you the best
>>> performance that can be obtained.  In these particular circumstances, ...,
>>> you may want to use the low-level RDD API while setting
>>> preservesPartitioning to true.  Like this...."
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>>> wrote:
>>>
>>> My understanding is that  the RDD's presently have more support for
>>> complete control of partitioning which is a key consideration at scale.
>>> While partitioning control is still piecemeal in  DF/DS  it would seem
>>> premature to make RDD's a second-tier approach to spark dev.
>>>
>>>
>>>
>>> An example is the use of groupBy when we know that the source relation
>>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>>> sql still does not allow that knowledge to be applied to the optimizer - so
>>> a full shuffle will be performed. However in the native RDD we can use
>>> preservesPartitioning=true.
>>>
>>>
>>>
>>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>
>>> The place of the RDD API in 2.0 is also something I've been wondering
>>> about.  I think it may be going too far to deprecate it, but changing
>>> emphasis is something that we might consider.  The RDD API came well before
>>> DataFrames and DataSets, so programming guides, introductory how-to
>>> articles and the like have, to this point, also tended to emphasize RDDs --
>>> or at least to deal with them early.  What I'm thinking is that with 2.0
>>> maybe we should overhaul all the documentation to de-emphasize and
>>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>>> introduced and fully addressed before RDDs.  They would be presented as the
>>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>>> would be presented later as a kind of lower-level, closer-to-the-metal API
>>> that can be used in atypical, more specialized contexts where DataFrames or
>>> DataSets don't fully fit.
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>> I am not sure what the best practice for this specific problem, but it’s
>>> really worth to think about it in 2.0, as it is a painful issue for lots of
>>> users.
>>>
>>>
>>>
>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>> internal API only?)? As lots of its functionality overlapping with
>>> DataFrame or DataSet.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>> *To:* Nicholas Chammas
>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>> Reynold Xin
>>>
>>>
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> I know we want to keep breaking changes to a minimum but I'm hoping that
>>> with Spark 2.0 we can also look at better classpath isolation with user
>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>> setting it true by default, and not allow any spark transitive dependencies
>>> to leak into user code. For backwards compatibility we can have a whitelist
>>> if we want but I'd be good if we start requiring user apps to explicitly
>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>> moving in this direction.
>>>
>>>
>>>
>>> Kostas
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>> On that note of deprecating stuff, it might be good to deprecate some
>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>> doesn’t have to wait for everything that we want to deprecate to be
>>> replaced all at once.
>>>
>>> Nick
>>>
>>> 
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>> alexander.ulanov@hpe.com> wrote:
>>>
>>> Parameter Server is a new feature and thus does not match the goal of
>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>
>>>
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>>
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>> *To:* witgo@qq.com
>>> *Cc:* dev@spark.apache.org
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> Being specific to Parameter Server, I think the current agreement is
>>> that PS shall exist as a third-party library instead of a component of the
>>> core code base, isn’t?
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>>
>>> --
>>>
>>> Nan Zhu
>>>
>>> http://codingcat.me
>>>
>>>
>>>
>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>
>>> Who has the idea of machine learning? Spark missing some features for
>>> machine learning, For example, the parameter server.
>>>
>>>
>>>
>>>
>>>
>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>
>>>
>>>
>>> I like the idea of popping out Tachyon to an optional component too to
>>> reduce the number of dependencies. In the future, it might even be useful
>>> to do this for Hadoop, but it requires too many API changes to be worth
>>> doing now.
>>>
>>>
>>>
>>> Regarding Scala 2.12, we should definitely support it eventually, but I
>>> don't think we need to block 2.0 on that because it can be added later too.
>>> Has anyone investigated what it would take to run on there? I imagine we
>>> don't need many code changes, just maybe some REPL stuff.
>>>
>>>
>>>
>>> Needless to say, but I'm all for the idea of making "major" releases as
>>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>>> working with the same set of releases is super important.
>>>
>>>
>>>
>>> Matei
>>>
>>>
>>>
>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>
>>>
>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> to the Spark community. A major release should not be very different
>>> from a
>>>
>>> minor release and should not be gated based on new features. The main
>>>
>>> purpose of a major release is an opportunity to fix things that are
>>> broken
>>>
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>>
>>>
>>> Agree with this stance. Generally, a major release might also be a
>>>
>>> time to replace some big old API or implementation with a new one, but
>>>
>>> I don't see obvious candidates.
>>>
>>>
>>>
>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>
>>> there's a fairly good reason to continue adding features in 1.x to a
>>>
>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>
>>>
>>>
>>>
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but
>>>
>>> it has been end-of-life.
>>>
>>>
>>>
>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>
>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>
>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>
>>>
>>>
>>>
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>>
>>>
>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>
>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>
>>>
>>>
>>> I'm sure we'll think of a number of other small things -- shading a
>>>
>>> bunch of stuff? reviewing and updating dependencies in light of
>>>
>>> simpler, more recent dependencies to support from Hadoop etc?
>>>
>>>
>>>
>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>
>>> Pop out any Docker stuff to another repo?
>>>
>>> Continue that same effort for EC2?
>>>
>>> Farming out some of the "external" integrations to another repo (?
>>>
>>> controversial)
>>>
>>>
>>>
>>> See also anything marked version "2+" in JIRA.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Why does stabilization of those two features require a 1.7 release instead
of 1.6.1?

On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <ko...@cloudera.com>
wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we can
> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
> propose we have one more 1.x release after Spark 1.6. This will allow us to
> stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to provide
>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>> same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that argues
>> for retaining the RDD API but not as the first thing presented to new Spark
>> developers: "Here's how to use groupBy with DataFrames.... Until the
>> optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this...."
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key consideration at scale.
>> While partitioning control is still piecemeal in  DF/DS  it would seem
>> premature to make RDD's a second-tier approach to spark dev.
>>
>>
>>
>> An example is the use of groupBy when we know that the source relation
>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>> sql still does not allow that knowledge to be applied to the optimizer - so
>> a full shuffle will be performed. However in the native RDD we can use
>> preservesPartitioning=true.
>>
>>
>>
>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>>
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> 
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ulanov@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* witgo@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things -- shading a
>>
>> bunch of stuff? reviewing and updating dependencies in light of
>>
>> simpler, more recent dependencies to support from Hadoop etc?
>>
>>
>>
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>
>> Pop out any Docker stuff to another repo?
>>
>> Continue that same effort for EC2?
>>
>> Farming out some of the "external" integrations to another repo (?
>>
>> controversial)
>>
>>
>>
>> See also anything marked version "2+" in JIRA.
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: A proposal for Spark 2.0

Posted by Kostas Sakellis <ko...@cloudera.com>.

We have veered off the topic of Spark 2.0 a little bit here - yes we can
talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
propose we have one more 1.x release after Spark 1.6. This will allow us to
stabilize a few of the new features that were added in 1.6:

1) the experimental Datasets API
2) the new unified memory manager.

I understand our goal for Spark 2.0 is to offer an easy transition but
there will be users that won't be able to seamlessly upgrade given what we
have discussed as in scope for 2.0. For these users, having a 1.x release
with these new features/APIs stabilized will be very beneficial. This might
make Spark 1.7 a lighter release but that is not necessarily a bad thing.

Any thoughts on this timeline?

Kostas Sakellis



On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <ha...@intel.com> wrote:

> Agree, more features/apis/optimization need to be added in DF/DS.
>
>
>
> I mean, we need to think about what kind of RDD APIs we have to provide to
> developer, maybe the fundamental API is enough, like, the ShuffledRDD
> etc..  But PairRDDFunctions probably not in this category, as we can do the
> same thing easily with DF/DS, even better performance.
>
>
>
> *From:* Mark Hamstra [mailto:mark@clearstorydata.com]
> *Sent:* Friday, November 13, 2015 11:23 AM
> *To:* Stephen Boesch
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Hmmm... to me, that seems like precisely the kind of thing that argues for
> retaining the RDD API but not as the first thing presented to new Spark
> developers: "Here's how to use groupBy with DataFrames.... Until the
> optimizer is more fully developed, that won't always get you the best
> performance that can be obtained.  In these particular circumstances, ...,
> you may want to use the low-level RDD API while setting
> preservesPartitioning to true.  Like this...."
>
>
>
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
>
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
>
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>
> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
>
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:
>
> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
> Reynold Xin
>
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
>
> 
>
>
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ulanov@hpe.com> wrote:
>
> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* witgo@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>

RE: A proposal for Spark 2.0

Posted by "Cheng, Hao" <ha...@intel.com>.

Agree, more features/apis/optimization need to be added in DF/DS.

I mean, we need to think about what kind of RDD APIs we have to provide to developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as we can do the same thing easily with DF/DS, even better performance.

From: Mark Hamstra [mailto:mark@clearstorydata.com]
Sent: Friday, November 13, 2015 11:23 AM
To: Stephen Boesch
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Hmmm... to me, that seems like precisely the kind of thing that argues for retaining the RDD API but not as the first thing presented to new Spark developers: "Here's how to use groupBy with DataFrames.... Until the optimizer is more fully developed, that won't always get you the best performance that can be obtained.  In these particular circumstances, ..., you may want to use the low-level RDD API while setting preservesPartitioning to true.  Like this...."

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com>> wrote:
My understanding is that  the RDD's presently have more support for complete control of partitioning which is a key consideration at scale.  While partitioning control is still piecemeal in  DF/DS  it would seem premature to make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark sql still does not allow that knowledge to be applied to the optimizer - so a full shuffle will be performed. However in the native RDD we can use preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>>:
The place of the RDD API in 2.0 is also something I've been wondering about.  I think it may be going too far to deprecate it, but changing emphasis is something that we might consider.  The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and the like have, to this point, also tended to emphasize RDDs -- or at least to deal with them early.  What I'm thinking is that with 2.0 maybe we should overhaul all the documentation to de-emphasize and reposition RDDs.  In this scheme, DataFrames and DataSets would be introduced and fully addressed before RDDs.  They would be presented as the normal/default/standard way to do things in Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, closer-to-the-metal API that can be used in atypical, more specialized contexts where DataFrames or DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com>> wrote:
I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kostas@cloudera.com<ma...@cloudera.com>]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com<ma...@qq.com>; dev@spark.apache.org<ma...@spark.apache.org>; Reynold Xin

Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <ni...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once.

Nick

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <al...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcgill@gmail.com<ma...@gmail.com>]
Sent: Thursday, November 12, 2015 7:28 AM
To: witgo@qq.com<ma...@qq.com>
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me

On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com<ma...@qq.com> wrote:
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.

在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com>> 写道：

I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.

2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Hmmm... to me, that seems like precisely the kind of thing that argues for
retaining the RDD API but not as the first thing presented to new Spark
developers: "Here's how to use groupBy with DataFrames.... Until the
optimizer is more fully developed, that won't always get you the best
performance that can be obtained.  In these particular circumstances, ...,
you may want to use the low-level RDD API while setting
preservesPartitioning to true.  Like this...."

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <ja...@gmail.com> wrote:

> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:
>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>> I am not sure what the best practice for this specific problem, but it’s
>>> really worth to think about it in 2.0, as it is a painful issue for lots of
>>> users.
>>>
>>>
>>>
>>> By the way, is it also an opportunity to deprecate the RDD API (or
>>> internal API only?)? As lots of its functionality overlapping with
>>> DataFrame or DataSet.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>>> *Sent:* Friday, November 13, 2015 5:27 AM
>>> *To:* Nicholas Chammas
>>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>>> Reynold Xin
>>>
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> I know we want to keep breaking changes to a minimum but I'm hoping that
>>> with Spark 2.0 we can also look at better classpath isolation with user
>>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>>> setting it true by default, and not allow any spark transitive dependencies
>>> to leak into user code. For backwards compatibility we can have a whitelist
>>> if we want but I'd be good if we start requiring user apps to explicitly
>>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>>> moving in this direction.
>>>
>>>
>>>
>>> Kostas
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>> On that note of deprecating stuff, it might be good to deprecate some
>>> things in 2.0 without removing or replacing them immediately. That way 2.0
>>> doesn’t have to wait for everything that we want to deprecate to be
>>> replaced all at once.
>>>
>>> Nick
>>>
>>> 
>>>
>>>
>>>
>>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>>> alexander.ulanov@hpe.com> wrote:
>>>
>>> Parameter Server is a new feature and thus does not match the goal of
>>> 2.0 is “to fix things that are broken in the current API and remove certain
>>> deprecated APIs”. At the same time I would be happy to have that feature.
>>>
>>>
>>>
>>> With regards to Machine learning, it would be great to move useful
>>> features from MLlib to ML and deprecate the former. Current structure of
>>> two separate machine learning packages seems to be somewhat confusing.
>>>
>>> With regards to GraphX, it would be great to deprecate the use of RDD in
>>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>>
>>>
>>>
>>> Best regards, Alexander
>>>
>>>
>>>
>>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>>> *Sent:* Thursday, November 12, 2015 7:28 AM
>>> *To:* witgo@qq.com
>>> *Cc:* dev@spark.apache.org
>>> *Subject:* Re: A proposal for Spark 2.0
>>>
>>>
>>>
>>> Being specific to Parameter Server, I think the current agreement is
>>> that PS shall exist as a third-party library instead of a component of the
>>> core code base, isn’t?
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>>
>>> --
>>>
>>> Nan Zhu
>>>
>>> http://codingcat.me
>>>
>>>
>>>
>>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>>
>>> Who has the idea of machine learning? Spark missing some features for
>>> machine learning, For example, the parameter server.
>>>
>>>
>>>
>>>
>>>
>>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>>
>>>
>>>
>>> I like the idea of popping out Tachyon to an optional component too to
>>> reduce the number of dependencies. In the future, it might even be useful
>>> to do this for Hadoop, but it requires too many API changes to be worth
>>> doing now.
>>>
>>>
>>>
>>> Regarding Scala 2.12, we should definitely support it eventually, but I
>>> don't think we need to block 2.0 on that because it can be added later too.
>>> Has anyone investigated what it would take to run on there? I imagine we
>>> don't need many code changes, just maybe some REPL stuff.
>>>
>>>
>>>
>>> Needless to say, but I'm all for the idea of making "major" releases as
>>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>>> working with the same set of releases is super important.
>>>
>>>
>>>
>>> Matei
>>>
>>>
>>>
>>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>
>>>
>>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> to the Spark community. A major release should not be very different
>>> from a
>>>
>>> minor release and should not be gated based on new features. The main
>>>
>>> purpose of a major release is an opportunity to fix things that are
>>> broken
>>>
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>>
>>>
>>> Agree with this stance. Generally, a major release might also be a
>>>
>>> time to replace some big old API or implementation with a new one, but
>>>
>>> I don't see obvious candidates.
>>>
>>>
>>>
>>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>>
>>> there's a fairly good reason to continue adding features in 1.x to a
>>>
>>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>>
>>>
>>>
>>>
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but
>>>
>>> it has been end-of-life.
>>>
>>>
>>>
>>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>>
>>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>>
>>> dropping 2.10. Otherwise it's supported for 2 more years.
>>>
>>>
>>>
>>>
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>>
>>>
>>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>>
>>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>>
>>>
>>>
>>> I'm sure we'll think of a number of other small things -- shading a
>>>
>>> bunch of stuff? reviewing and updating dependencies in light of
>>>
>>> simpler, more recent dependencies to support from Hadoop etc?
>>>
>>>
>>>
>>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>>
>>> Pop out any Docker stuff to another repo?
>>>
>>> Continue that same effort for EC2?
>>>
>>> Farming out some of the "external" integrations to another repo (?
>>>
>>> controversial)
>>>
>>>
>>>
>>> See also anything marked version "2+" in JIRA.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Stephen Boesch <ja...@gmail.com>.

My understanding is that  the RDD's presently have more support for
complete control of partitioning which is a key consideration at scale.
While partitioning control is still piecemeal in  DF/DS  it would seem
premature to make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation
(/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
sql still does not allow that knowledge to be applied to the optimizer - so
a full shuffle will be performed. However in the native RDD we can use
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra <ma...@clearstorydata.com>:

> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> 
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ulanov@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* witgo@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things -- shading a
>>
>> bunch of stuff? reviewing and updating dependencies in light of
>>
>> simpler, more recent dependencies to support from Hadoop etc?
>>
>>
>>
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>
>> Pop out any Docker stuff to another repo?
>>
>> Continue that same effort for EC2?
>>
>> Farming out some of the "external" integrations to another repo (?
>>
>> controversial)
>>
>>
>>
>> See also anything marked version "2+" in JIRA.
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

The place of the RDD API in 2.0 is also something I've been wondering
about.  I think it may be going too far to deprecate it, but changing
emphasis is something that we might consider.  The RDD API came well before
DataFrames and DataSets, so programming guides, introductory how-to
articles and the like have, to this point, also tended to emphasize RDDs --
or at least to deal with them early.  What I'm thinking is that with 2.0
maybe we should overhaul all the documentation to de-emphasize and
reposition RDDs.  In this scheme, DataFrames and DataSets would be
introduced and fully addressed before RDDs.  They would be presented as the
normal/default/standard way to do things in Spark.  RDDs, in contrast,
would be presented later as a kind of lower-level, closer-to-the-metal API
that can be used in atypical, more specialized contexts where DataFrames or
DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <ha...@intel.com> wrote:

> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kostas@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org;
> Reynold Xin
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
>
> 
>
>
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ulanov@hpe.com> wrote:
>
> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* witgo@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>

RE: A proposal for Spark 2.0

Posted by "Cheng, Hao" <ha...@intel.com>.

I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kostas@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; witgo@qq.com; dev@spark.apache.org; Reynold Xin
Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies to leak into user code. For backwards compatibility we can have a whitelist if we want but I'd be good if we start requiring user apps to explicitly pull in all their dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <ni...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 2.0 without removing or replacing them immediately. That way 2.0 doesn’t have to wait for everything that we want to deprecate to be replaced all at once.

Nick

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <al...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcgill@gmail.com<ma...@gmail.com>]
Sent: Thursday, November 12, 2015 7:28 AM
To: witgo@qq.com<ma...@qq.com>
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me

On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com<ma...@qq.com> wrote:
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.

在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com>> 写道：

I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.

2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

Re: A proposal for Spark 2.0

Posted by Kostas Sakellis <ko...@cloudera.com>.

I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
to leak into user code. For backwards compatibility we can have a whitelist
if we want but I'd be good if we start requiring user apps to explicitly
pull in all their dependencies. From what I can tell, Hadoop 3 is also
moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
> 
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ulanov@hpe.com> wrote:
>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* witgo@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>
>>
>>
>> I'm sure we'll think of a number of other small things -- shading a
>>
>> bunch of stuff? reviewing and updating dependencies in light of
>>
>> simpler, more recent dependencies to support from Hadoop etc?
>>
>>
>>
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>>
>> Pop out any Docker stuff to another repo?
>>
>> Continue that same effort for EC2?
>>
>> Farming out some of the "external" integrations to another repo (?
>>
>> controversial)
>>
>>
>>
>> See also anything marked version "2+" in JIRA.
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Nicholas Chammas <ni...@gmail.com>.

With regards to Machine learning, it would be great to move useful features
from MLlib to ML and deprecate the former. Current structure of two
separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in
GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some
things in 2.0 without removing or replacing them immediately. That way 2.0
doesn’t have to wait for everything that we want to deprecate to be
replaced all at once.

Nick


On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <al...@hpe.com>
wrote:

> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcgill@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* witgo@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
>

RE: A proposal for Spark 2.0

Posted by "Ulanov, Alexander" <al...@hpe.com>.

Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: witgo@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me

On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com<ma...@qq.com> wrote:
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.

在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com>> 写道：

I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com>> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.

1. Scala 2.11 as the default build. We should still support Scala 2.10, but
it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.

2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

Re: A proposal for Spark 2.0

Posted by Nan Zhu <zh...@gmail.com>.

Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t?

Best,  

--  
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, witgo@qq.com wrote:

> Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.
>  
>  
> > 在 2015年11月12日，05:32，Matei Zaharia <matei.zaharia@gmail.com (mailto:matei.zaharia@gmail.com)> 写道：
> >  
> > I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.
> >  
> > Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.
> >  
> > Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.
> >  
> > Matei
> >  
> > > On Nov 11, 2015, at 4:58 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > >  
> > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rxin@databricks.com (mailto:rxin@databricks.com)> wrote:
> > > > to the Spark community. A major release should not be very different from a
> > > > minor release and should not be gated based on new features. The main
> > > > purpose of a major release is an opportunity to fix things that are broken
> > > > in the current API and remove certain deprecated APIs (examples follow).
> > > >  
> > >  
> > >  
> > > Agree with this stance. Generally, a major release might also be a
> > > time to replace some big old API or implementation with a new one, but
> > > I don't see obvious candidates.
> > >  
> > > I wouldn't mind turning attention to 2.x sooner than later, unless
> > > there's a fairly good reason to continue adding features in 1.x to a
> > > 1.7 release. The scope as of 1.6 is already pretty darned big.
> > >  
> > >  
> > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> > > > it has been end-of-life.
> > > >  
> > >  
> > >  
> > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > > dropping 2.10. Otherwise it's supported for 2 more years.
> > >  
> > >  
> > > > 2. Remove Hadoop 1 support.
> > >  
> > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > sort of 'alpha' and 'beta' releases) and even <2.6.
> > >  
> > > I'm sure we'll think of a number of other small things -- shading a
> > > bunch of stuff? reviewing and updating dependencies in light of
> > > simpler, more recent dependencies to support from Hadoop etc?
> > >  
> > > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > > Pop out any Docker stuff to another repo?
> > > Continue that same effort for EC2?
> > > Farming out some of the "external" integrations to another repo (?
> > > controversial)
> > >  
> > > See also anything marked version "2+" in JIRA.
> > >  
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > > For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> > >  
> >  
> >  
> >  
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> > For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
> >  
>  
>  
>  
>  
>  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org (mailto:dev-unsubscribe@spark.apache.org)
> For additional commands, e-mail: dev-help@spark.apache.org (mailto:dev-help@spark.apache.org)
>  
>

Re: A proposal for Spark 2.0

Posted by wi...@qq.com.

Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server.


> 在 2015年11月12日，05:32，Matei Zaharia <ma...@gmail.com> 写道：
> 
> I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.
> 
> Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.
> 
> Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.
> 
> Matei
> 
>> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
>> 
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>> 
>> Agree with this stance. Generally, a major release might also be a
>> time to replace some big old API or implementation with a new one, but
>> I don't see obvious candidates.
>> 
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>> there's a fairly good reason to continue adding features in 1.x to a
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>> 
>> 
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>>> it has been end-of-life.
>> 
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>> dropping 2.10. Otherwise it's supported for 2 more years.
>> 
>> 
>>> 2. Remove Hadoop 1 support.
>> 
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>> 
>> I'm sure we'll think of a number of other small things -- shading a
>> bunch of stuff? reviewing and updating dependencies in light of
>> simpler, more recent dependencies to support from Hadoop etc?
>> 
>> Farming out Tachyon to a module? (I felt like someone proposed this?)
>> Pop out any Docker stuff to another repo?
>> Continue that same effort for EC2?
>> Farming out some of the "external" integrations to another repo (?
>> controversial)
>> 
>> See also anything marked version "2+" in JIRA.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Prashant Sharma <sc...@gmail.com>.

Hey Matei,


> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.


Our REPL specific changes were merged in scala/scala and are available as
part of 2.11.7 and hopefully be part of 2.12 too. If I am not wrong, REPL
stuff is taken care of, we don;t need to keep upgrading REPL code for every
scala release now. http://www.scala-lang.org/news/2.11.7

I am +1 on the proposal for Spark 2.0.

Thanks,


Prashant Sharma



On Thu, Nov 12, 2015 at 3:02 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
> Matei
>
> > On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >
> > Agree with this stance. Generally, a major release might also be a
> > time to replace some big old API or implementation with a new one, but
> > I don't see obvious candidates.
> >
> > I wouldn't mind turning attention to 2.x sooner than later, unless
> > there's a fairly good reason to continue adding features in 1.x to a
> > 1.7 release. The scope as of 1.6 is already pretty darned big.
> >
> >
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> >> it has been end-of-life.
> >
> > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > dropping 2.10. Otherwise it's supported for 2 more years.
> >
> >
> >> 2. Remove Hadoop 1 support.
> >
> > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > sort of 'alpha' and 'beta' releases) and even <2.6.
> >
> > I'm sure we'll think of a number of other small things -- shading a
> > bunch of stuff? reviewing and updating dependencies in light of
> > simpler, more recent dependencies to support from Hadoop etc?
> >
> > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > Pop out any Docker stuff to another repo?
> > Continue that same effort for EC2?
> > Farming out some of the "external" integrations to another repo (?
> > controversial)
> >
> > See also anything marked version "2+" in JIRA.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Posted by Matei Zaharia <ma...@gmail.com>.

I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't think we need to block 2.0 on that because it can be added later too. Has anyone investigated what it would take to run on there? I imagine we don't need many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as undisruptive as possible in the model Reynold proposed. Keeping everyone working with the same set of releases is super important.

Matei

> On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
> 
> Agree with this stance. Generally, a major release might also be a
> time to replace some big old API or implementation with a new one, but
> I don't see obvious candidates.
> 
> I wouldn't mind turning attention to 2.x sooner than later, unless
> there's a fairly good reason to continue adding features in 1.x to a
> 1.7 release. The scope as of 1.6 is already pretty darned big.
> 
> 
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>> it has been end-of-life.
> 
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
> dropping 2.10. Otherwise it's supported for 2 more years.
> 
> 
>> 2. Remove Hadoop 1 support.
> 
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> sort of 'alpha' and 'beta' releases) and even <2.6.
> 
> I'm sure we'll think of a number of other small things -- shading a
> bunch of stuff? reviewing and updating dependencies in light of
> simpler, more recent dependencies to support from Hadoop etc?
> 
> Farming out Tachyon to a module? (I felt like someone proposed this?)
> Pop out any Docker stuff to another repo?
> Continue that same effort for EC2?
> Farming out some of the "external" integrations to another repo (?
> controversial)
> 
> See also anything marked version "2+" in JIRA.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <rx...@databricks.com> wrote:
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.

> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> it has been end-of-life.

By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
be quite stable, and 2.10 will have been EOL for a while. I'd propose
dropping 2.10. Otherwise it's supported for 2 more years.

> 2. Remove Hadoop 1 support.

I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
sort of 'alpha' and 'beta' releases) and even <2.6.

I'm sure we'll think of a number of other small things -- shading a
bunch of stuff? reviewing and updating dependencies in light of
simpler, more recent dependencies to support from Hadoop etc?

Farming out Tachyon to a module? (I felt like someone proposed this?)
Pop out any Docker stuff to another repo?
Continue that same effort for EC2?
Farming out some of the "external" integrations to another repo (?
controversial)

See also anything marked version "2+" in JIRA.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Tao Wang <wa...@163.com>.

How about the Hive dependency? We use ThriftServer, serdes and even the
parser/execute logic in Hive. Where will we direct about this part?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15793.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Jonathan Kelly <jo...@gmail.com>.

If Scala 2.12 will require Java 8 and we want to enable cross-compiling
Spark against Scala 2.11 and 2.12, couldn't we just make Java 8 a
requirement if you want to use Scala 2.12?

On Wed, Nov 11, 2015 at 9:29 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i would drop scala 2.10, but definitely keep java 7
>
> cross build for scala 2.12 is great, but i dont know how that works with
> java 8 requirement. dont want to make java 8 mandatory.
>
> and probably stating the obvious, but a lot of apis got polluted due to
> binary compatibility requirement. cleaning that up assuming only source
> compatibility would be a good idea, right?
>
> On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.

i would drop scala 2.10, but definitely keep java 7

cross build for scala 2.12 is great, but i dont know how that works with
java 8 requirement. dont want to make java 8 mandatory.

and probably stating the obvious, but a lot of apis got polluted due to
binary compatibility requirement. cleaning that up assuming only source
compatibility would be a good idea, right?

On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin <rx...@databricks.com> wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

Posted by Zoltán Zvara <zo...@gmail.com>.

Hi,

Reconsidering the execution model behind Streaming would be a good
candidate here, as Spark will not be able to provide the low latency and
sophisticated windowing semantics that more and more use-cases will
require. Maybe relaxing the strict batch model would help a lot. (Mainly
this would hit the shuffling, but the shuffle package suffers from
overlapping functionalities, lack of good modularity anyway. Look at how
coalesce implemented for example - inefficiency also kicks in there.)

On Wed, Nov 11, 2015 at 12:48 PM Tim Preece <te...@mail.com> wrote:

> Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 (
> pencilled in for Jan 2016 ) make any sense ? - although that would then
> pre-req Java 8.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Posted by Tim Preece <te...@mail.com>.

Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 (
pencilled in for Jan 2016 ) make any sense ? - although that would then
pre-req Java 8.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15153.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Mridul Muralidharan <mr...@gmail.com>.

Would be also good to fix api breakages introduced as part of 1.0
(where there is missing functionality now), overhaul & remove all
deprecated config/features/combinations, api changes that we need to
make to public api which has been deferred for minor releases.

Regards,
Mridul

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in 1.7,
> 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to do
> Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after
> Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major
> releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing in
> Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
> 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some subtle
> differences, these two are very similar but have completely different code
> path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
> it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Nicholas Chammas <ni...@gmail.com>.

Yeah, I'd also favor maintaining docs with strictly temporary relevance on
JIRA when possible. The wiki is like this weird backwater I only rarely
visit.

Don't we typically do this kind of stuff with an umbrella issue on JIRA?
Tom, wouldn't that work well for you?

Nick

On Wed, Dec 23, 2015 at 5:06 AM Sean Owen <so...@cloudera.com> wrote:

> I think this will be hard to maintain; we already have JIRA as the de
> facto central place to store discussions and prioritize work, and the
> 2.x stuff is already a JIRA. The wiki doesn't really hurt, just
> probably will never be looked at again. Let's point people in all
> cases to JIRA.
>
> On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin <rx...@databricks.com> wrote:
> > I started a wiki page:
> >
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
> >
> >
> > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves <tg...@yahoo.com>
> wrote:
> >>
> >> Do we have a summary of all the discussions and what is planned for 2.0
> >> then?  Perhaps we should put on the wiki for reference.
> >>
> >> Tom
> >>
> >>
> >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <
> rxin@databricks.com>
> >> wrote:
> >>
> >>
> >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
> >>
> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>
> >> I’m starting a new thread since the other one got intermixed with
> feature
> >> requests. Please refrain from making feature request in this thread. Not
> >> that we shouldn’t be adding features, but we can always add features in
> 1.7,
> >> 2.1, 2.2, ...
> >>
> >> First - I want to propose a premise for how to think about Spark 2.0 and
> >> major releases in Spark, based on discussion with several members of the
> >> community: a major release should be low overhead and minimally
> disruptive
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >>
> >> For this reason, I would *not* propose doing major releases to break
> >> substantial API's or perform large re-architecting that prevent users
> from
> >> upgrading. Spark has always had a culture of evolving architecture
> >> incrementally and making changes - and I don't think we want to change
> this
> >> model. In fact, we’ve released many architectural changes on the 1.X
> line.
> >>
> >> If the community likes the above model, then to me it seems reasonable
> to
> >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
> >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence
> of
> >> major releases every 2 years seems doable within the above model.
> >>
> >> Under this model, here is a list of example things I would propose doing
> >> in Spark 2.0, separated into APIs and Operation/Deployment:
> >>
> >>
> >> APIs
> >>
> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> >> Spark 1.x.
> >>
> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> >> applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
> >> about user applications being unable to use Akka due to Spark’s
> dependency
> >> on Akka.
> >>
> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
> >>
> >> 4. Better class package structure for low level developer API’s. In
> >> particular, we have some DeveloperApi (mostly various listener-related
> >> classes) added over the years. Some packages include only one or two
> public
> >> classes but a lot of private classes. A better structure is to have
> public
> >> classes isolated to a few public packages, and these public packages
> should
> >> have minimal private classes for low level developer APIs.
> >>
> >> 5. Consolidate task metric and accumulator API. Although having some
> >> subtle differences, these two are very similar but have completely
> different
> >> code path.
> >>
> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
> moving
> >> them to other package(s). They are already used beyond SQL, e.g. in ML
> >> pipelines, and will be used by streaming also.
> >>
> >>
> >> Operation/Deployment
> >>
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> >> but it has been end-of-life.
> >>
> >> 2. Remove Hadoop 1 support.
> >>
> >> 3. Assembly-free distribution of Spark: don’t require building an
> enormous
> >> assembly jar in order to run Spark.
> >>
> >>
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: A proposal for Spark 2.0

Posted by Sean Owen <so...@cloudera.com>.

I think this will be hard to maintain; we already have JIRA as the de
facto central place to store discussions and prioritize work, and the
2.x stuff is already a JIRA. The wiki doesn't really hurt, just
probably will never be looked at again. Let's point people in all
cases to JIRA.

On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin <rx...@databricks.com> wrote:
> I started a wiki page:
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
>
>
> On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves <tg...@yahoo.com> wrote:
>>
>> Do we have a summary of all the discussions and what is planned for 2.0
>> then?  Perhaps we should put on the wiki for reference.
>>
>> Tom
>>
>>
>> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>>
>> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in 1.7,
>> 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely different
>> code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
>> them to other package(s). They are already used beyond SQL, e.g. in ML
>> pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an enormous
>> assembly jar in order to run Spark.
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

I started a wiki page:
https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions


On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves <tg...@yahoo.com> wrote:

> Do we have a summary of all the discussions and what is planned for 2.0
> then?  Perhaps we should put on the wiki for reference.
>
> Tom
>
>
> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <rx...@databricks.com>
> wrote:
>
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>
>
>
>

Re: A proposal for Spark 2.0

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

Do we have a summary of all the discussions and what is planned for 2.0 then?  Perhaps we should put on the wiki for reference.
Tom 

    On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <rx...@databricks.com> wrote:
 

 FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 
On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:

I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ...
First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow).
For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line.
If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model.
Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment:

APIs
1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.
2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka.
3. Remove Guava from Spark’s public API (JavaRDD Optional).
4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs.
5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path.
6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also.

Operation/Deployment
1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life.
2. Remove Hadoop 1 support. 
3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

Posted by Allen Zhang <al...@126.com>.


Thanks your quick respose, ok, I will start a new thread with my thoughts


Thanks,
Allen





At 2015-12-22 15:19:49, "Reynold Xin" <rx...@databricks.com> wrote:

I'm not sure if we need special API support for GPUs. You can already use GPUs on individual executor nodes to build your own applications. If we want to leverage GPUs out of the box, I don't think the solution is to provide GPU specific APIs. Rather, we should just switch the underlying execution to GPUs when it is more optimal.


Anyway, I don't want to distract this topic, If you want to discuss more about GPUs, please start a new thread.




On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang <al...@126.com> wrote:

plus dev







在 2015-12-22 15:15:59，"Allen Zhang" <al...@126.com> 写道：

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50，"Reynold Xin" <rx...@databricks.com> 写道：

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:

I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

I'm not sure if we need special API support for GPUs. You can already use
GPUs on individual executor nodes to build your own applications. If we
want to leverage GPUs out of the box, I don't think the solution is to
provide GPU specific APIs. Rather, we should just switch the underlying
execution to GPUs when it is more optimal.

Anyway, I don't want to distract this topic, If you want to discuss more
about GPUs, please start a new thread.


On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang <al...@126.com> wrote:

> plus dev
>
>
>
>
>
>
> 在 2015-12-22 15:15:59，"Allen Zhang" <al...@126.com> 写道：
>
> Hi Reynold,
>
> Any new API support for GPU computing in our 2.0 new version ?
>
> -Allen
>
>
>
>
> 在 2015-12-22 14:12:50，"Reynold Xin" <rx...@databricks.com> 写道：
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>
>
>
>
>
>
>
>

Re: A proposal for Spark 2.0

Posted by Allen Zhang <al...@126.com>.

plus dev






在 2015-12-22 15:15:59，"Allen Zhang" <al...@126.com> 写道：

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50，"Reynold Xin" <rx...@databricks.com> 写道：

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:

I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin <rx...@databricks.com> wrote:
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would be
> helpful.

I don't know if that's what Mark had in mind, but I'd count the
"remove Guava Optional from Java API" in that category. It would be
nice to have an alternative before that API is removed, although I
have no idea how you'd do it nicely, given that they're all in return
types (so overloading doesn't really work).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

To take a stab at an example of something concrete and anticipatory I can
go back to something I mentioned previously.  It's not really a good
example because I don't mean to imply that I believe that its premises are
true, but try to go with it.... If we were to decide that real-time,
event-based streaming is something that we really think we'll want to do in
the 2.x cycle and that the current API (after having deprecations removed
and clear mistakes/inadequacies remedied) isn't adequate to support that,
would we want to "take our best shot" at defining a new API at the outset
of 2.0?  Another way of looking at it is whether API changes in 2.0 should
be entirely backward-looking, trying to fix problems that we've already
identified or whether there is room for some forward-looking changes that
are intended to open new directions for Spark development.

On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Heh... ok, I was intentionally pushing those bullet points to be extreme
> to find where people would start pushing back, and I'll agree that we do
> probably want some new features in 2.0 -- but I think we've got good
> agreement that new features aren't really the main point of doing a 2.0
> release.
>
> I don't really have a concrete example of an anticipatory change, and
> that's actually kind of the problem with trying to anticipate what we'll
> need in the way of new public API and the like: Until what we already have
> is clearly inadequate, it hard to concretely imagine how things really
> should be.  At this point I don't have anything specific where I can say "I
> really want to do __ with Spark in the future, and I think it should be
> changed in this way in 2.0 to allow me to do that."  I'm just wondering
> whether we want to even entertain those kinds of change requests if people
> have them, or whether we can just delay making those kinds of decisions
> until it is really obvious that what we have does't work and that there is
> clearly something better that should be done.
>
> On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Mark,
>>
>> I think we are in agreement, although I wouldn't go to the extreme and
>> say "a release with no new features might even be best."
>>
>> Can you elaborate "anticipatory changes"? A concrete example or so would
>> be helpful.
>>
>> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> I'm liking the way this is shaping up, and I'd summarize it this way
>>> (let me know if I'm misunderstanding or misrepresenting anything):
>>>
>>>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>>>    release with no new features might even be best.
>>>    - Remove deprecated API that we agree really should be deprecated.
>>>    - Fix/change publicly-visible things that anyone who has spent any
>>>    time looking at already knows are mistakes or should be done better, but
>>>    that can't be changed within 1.x.
>>>
>>> Do we want to attempt anticipatory changes at all?  In other words, are
>>> there things we want to do in 2.x for which we already know that we'll want
>>> to make publicly-visible changes or that, if we don't add or change it now,
>>> will fall into the "everybody knows it shouldn't be that way" category when
>>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>>> try at all to anticipate what is needed -- working from the premise that
>>> being forced into a 3.x release earlier than we expect would be less
>>> painful than trying to back out a mistake made at the outset of 2.0 while
>>> trying to guess what we'll need.
>>>
>>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I’m starting a new thread since the other one got intermixed with
>>>> feature requests. Please refrain from making feature request in this
>>>> thread. Not that we shouldn’t be adding features, but we can always add
>>>> features in 1.7, 2.1, 2.2, ...
>>>>
>>>> First - I want to propose a premise for how to think about Spark 2.0
>>>> and major releases in Spark, based on discussion with several members of
>>>> the community: a major release should be low overhead and minimally
>>>> disruptive to the Spark community. A major release should not be very
>>>> different from a minor release and should not be gated based on new
>>>> features. The main purpose of a major release is an opportunity to fix
>>>> things that are broken in the current API and remove certain deprecated
>>>> APIs (examples follow).
>>>>
>>>> For this reason, I would *not* propose doing major releases to break
>>>> substantial API's or perform large re-architecting that prevent users from
>>>> upgrading. Spark has always had a culture of evolving architecture
>>>> incrementally and making changes - and I don't think we want to change this
>>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>>
>>>> If the community likes the above model, then to me it seems reasonable
>>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>>> cadence of major releases every 2 years seems doable within the above model.
>>>>
>>>> Under this model, here is a list of example things I would propose
>>>> doing in Spark 2.0, separated into APIs and Operation/Deployment:
>>>>
>>>>
>>>> APIs
>>>>
>>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>>> Spark 1.x.
>>>>
>>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>>> about user applications being unable to use Akka due to Spark’s dependency
>>>> on Akka.
>>>>
>>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>>
>>>> 4. Better class package structure for low level developer API’s. In
>>>> particular, we have some DeveloperApi (mostly various listener-related
>>>> classes) added over the years. Some packages include only one or two public
>>>> classes but a lot of private classes. A better structure is to have public
>>>> classes isolated to a few public packages, and these public packages should
>>>> have minimal private classes for low level developer APIs.
>>>>
>>>> 5. Consolidate task metric and accumulator API. Although having some
>>>> subtle differences, these two are very similar but have completely
>>>> different code path.
>>>>
>>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>>> ML pipelines, and will be used by streaming also.
>>>>
>>>>
>>>> Operation/Deployment
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but it has been end-of-life.
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> enormous assembly jar in order to run Spark.
>>>>
>>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Heh... ok, I was intentionally pushing those bullet points to be extreme to
find where people would start pushing back, and I'll agree that we do
probably want some new features in 2.0 -- but I think we've got good
agreement that new features aren't really the main point of doing a 2.0
release.

I don't really have a concrete example of an anticipatory change, and
that's actually kind of the problem with trying to anticipate what we'll
need in the way of new public API and the like: Until what we already have
is clearly inadequate, it hard to concretely imagine how things really
should be.  At this point I don't have anything specific where I can say "I
really want to do __ with Spark in the future, and I think it should be
changed in this way in 2.0 to allow me to do that."  I'm just wondering
whether we want to even entertain those kinds of change requests if people
have them, or whether we can just delay making those kinds of decisions
until it is really obvious that what we have does't work and that there is
clearly something better that should be done.

On Tue, Nov 10, 2015 at 6:51 PM, Reynold Xin <rx...@databricks.com> wrote:

> Mark,
>
> I think we are in agreement, although I wouldn't go to the extreme and say
> "a release with no new features might even be best."
>
> Can you elaborate "anticipatory changes"? A concrete example or so would
> be helpful.
>
> On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> I'm liking the way this is shaping up, and I'd summarize it this way (let
>> me know if I'm misunderstanding or misrepresenting anything):
>>
>>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>>    release with no new features might even be best.
>>    - Remove deprecated API that we agree really should be deprecated.
>>    - Fix/change publicly-visible things that anyone who has spent any
>>    time looking at already knows are mistakes or should be done better, but
>>    that can't be changed within 1.x.
>>
>> Do we want to attempt anticipatory changes at all?  In other words, are
>> there things we want to do in 2.x for which we already know that we'll want
>> to make publicly-visible changes or that, if we don't add or change it now,
>> will fall into the "everybody knows it shouldn't be that way" category when
>> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
>> try at all to anticipate what is needed -- working from the premise that
>> being forced into a 3.x release earlier than we expect would be less
>> painful than trying to back out a mistake made at the outset of 2.0 while
>> trying to guess what we'll need.
>>
>> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

Mark,

I think we are in agreement, although I wouldn't go to the extreme and say
"a release with no new features might even be best."

Can you elaborate "anticipatory changes"? A concrete example or so would be
helpful.

On Tue, Nov 10, 2015 at 5:19 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> I'm liking the way this is shaping up, and I'd summarize it this way (let
> me know if I'm misunderstanding or misrepresenting anything):
>
>    - New features are not at all the focus of Spark 2.0 -- in fact, a
>    release with no new features might even be best.
>    - Remove deprecated API that we agree really should be deprecated.
>    - Fix/change publicly-visible things that anyone who has spent any
>    time looking at already knows are mistakes or should be done better, but
>    that can't be changed within 1.x.
>
> Do we want to attempt anticipatory changes at all?  In other words, are
> there things we want to do in 2.x for which we already know that we'll want
> to make publicly-visible changes or that, if we don't add or change it now,
> will fall into the "everybody knows it shouldn't be that way" category when
> it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
> try at all to anticipate what is needed -- working from the premise that
> being forced into a 3.x release earlier than we expect would be less
> painful than trying to back out a mistake made at the outset of 2.0 while
> trying to guess what we'll need.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

I'm liking the way this is shaping up, and I'd summarize it this way (let
me know if I'm misunderstanding or misrepresenting anything):

   - New features are not at all the focus of Spark 2.0 -- in fact, a
   release with no new features might even be best.
   - Remove deprecated API that we agree really should be deprecated.
   - Fix/change publicly-visible things that anyone who has spent any time
   looking at already knows are mistakes or should be done better, but that
   can't be changed within 1.x.

Do we want to attempt anticipatory changes at all?  In other words, are
there things we want to do in 2.x for which we already know that we'll want
to make publicly-visible changes or that, if we don't add or change it now,
will fall into the "everybody knows it shouldn't be that way" category when
it comes time to discuss the Spark 3.0 release?  I'd be fine if we don't
try at all to anticipate what is needed -- working from the premise that
being forced into a 3.x release earlier than we expect would be less
painful than trying to back out a mistake made at the outset of 2.0 while
trying to guess what we'll need.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <rx...@databricks.com> wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Agree, it makes sense.

Regards
JB

On 11/11/2015 01:28 AM, Reynold Xin wrote:
> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman
> <shivaram@eecs.berkeley.edu <ma...@eecs.berkeley.edu>> wrote:
>
>     +1
>
>     On a related note I think making it lightweight will ensure that we
>     stay on the current release schedule and don't unnecessarily delay 2.0
>     to wait for new features / big architectural changes.
>
>     In terms of fixes to 1.x, I think our current policy of back-porting
>     fixes to older releases would still apply. I don't think developing
>     new features on both 1.x and 2.x makes a lot of sense as we would like
>     users to switch to 2.x.
>
>     Shivaram
>
>     On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis
>     <kostas@cloudera.com <ma...@cloudera.com>> wrote:
>      > +1 on a lightweight 2.0
>      >
>      > What is the thinking around the 1.x line after Spark 2.0 is
>     released? If not
>      > terminated, how will we determine what goes into each major
>     version line?
>      > Will 1.x only be for stability fixes?
>      >
>      > Thanks,
>      > Kostas
>      >
>      > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell
>     <pwendell@gmail.com <ma...@gmail.com>> wrote:
>      >>
>      >> I also feel the same as Reynold. I agree we should minimize API
>     breaks and
>      >> focus on fixing things around the edge that were mistakes (e.g.
>     exposing
>      >> Guava and Akka) rather than any overhaul that could fragment the
>     community.
>      >> Ideally a major release is a lightweight process we can do every
>     couple of
>      >> years, with minimal impact for users.
>      >>
>      >> - Patrick
>      >>
>      >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>      >> <nicholas.chammas@gmail.com <ma...@gmail.com>>
>     wrote:
>      >>>
>      >>> > For this reason, I would *not* propose doing major releases
>     to break
>      >>> > substantial API's or perform large re-architecting that
>     prevent users from
>      >>> > upgrading. Spark has always had a culture of evolving
>     architecture
>      >>> > incrementally and making changes - and I don't think we want
>     to change this
>      >>> > model.
>      >>>
>      >>> +1 for this. The Python community went through a lot of turmoil
>     over the
>      >>> Python 2 -> Python 3 transition because the upgrade process was
>     too painful
>      >>> for too long. The Spark community will benefit greatly from our
>     explicitly
>      >>> looking to avoid a similar situation.
>      >>>
>      >>> > 3. Assembly-free distribution of Spark: don’t require building an
>      >>> > enormous assembly jar in order to run Spark.
>      >>>
>      >>> Could you elaborate a bit on this? I'm not sure what an
>     assembly-free
>      >>> distribution means.
>      >>>
>      >>> Nick
>      >>>
>      >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin
>     <rxin@databricks.com <ma...@databricks.com>> wrote:
>      >>>>
>      >>>> I’m starting a new thread since the other one got intermixed with
>      >>>> feature requests. Please refrain from making feature request
>     in this thread.
>      >>>> Not that we shouldn’t be adding features, but we can always
>     add features in
>      >>>> 1.7, 2.1, 2.2, ...
>      >>>>
>      >>>> First - I want to propose a premise for how to think about
>     Spark 2.0 and
>      >>>> major releases in Spark, based on discussion with several
>     members of the
>      >>>> community: a major release should be low overhead and
>     minimally disruptive
>      >>>> to the Spark community. A major release should not be very
>     different from a
>      >>>> minor release and should not be gated based on new features.
>     The main
>      >>>> purpose of a major release is an opportunity to fix things
>     that are broken
>      >>>> in the current API and remove certain deprecated APIs
>     (examples follow).
>      >>>>
>      >>>> For this reason, I would *not* propose doing major releases to
>     break
>      >>>> substantial API's or perform large re-architecting that
>     prevent users from
>      >>>> upgrading. Spark has always had a culture of evolving architecture
>      >>>> incrementally and making changes - and I don't think we want
>     to change this
>      >>>> model. In fact, we’ve released many architectural changes on
>     the 1.X line.
>      >>>>
>      >>>> If the community likes the above model, then to me it seems
>     reasonable
>      >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7)
>     or immediately
>      >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>     cadence of
>      >>>> major releases every 2 years seems doable within the above model.
>      >>>>
>      >>>> Under this model, here is a list of example things I would
>     propose doing
>      >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>      >>>>
>      >>>>
>      >>>> APIs
>      >>>>
>      >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel)
>     deprecated in
>      >>>> Spark 1.x.
>      >>>>
>      >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>      >>>> applications can use Akka (SPARK-5293). We have gotten a lot
>     of complaints
>      >>>> about user applications being unable to use Akka due to
>     Spark’s dependency
>      >>>> on Akka.
>      >>>>
>      >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>      >>>>
>      >>>> 4. Better class package structure for low level developer
>     API’s. In
>      >>>> particular, we have some DeveloperApi (mostly various
>     listener-related
>      >>>> classes) added over the years. Some packages include only one
>     or two public
>      >>>> classes but a lot of private classes. A better structure is to
>     have public
>      >>>> classes isolated to a few public packages, and these public
>     packages should
>      >>>> have minimal private classes for low level developer APIs.
>      >>>>
>      >>>> 5. Consolidate task metric and accumulator API. Although
>     having some
>      >>>> subtle differences, these two are very similar but have
>     completely different
>      >>>> code path.
>      >>>>
>      >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more
>     general by
>      >>>> moving them to other package(s). They are already used beyond
>     SQL, e.g. in
>      >>>> ML pipelines, and will be used by streaming also.
>      >>>>
>      >>>>
>      >>>> Operation/Deployment
>      >>>>
>      >>>> 1. Scala 2.11 as the default build. We should still support
>     Scala 2.10,
>      >>>> but it has been end-of-life.
>      >>>>
>      >>>> 2. Remove Hadoop 1 support.
>      >>>>
>      >>>> 3. Assembly-free distribution of Spark: don’t require building an
>      >>>> enormous assembly jar in order to run Spark.
>      >>>>
>      >>
>      >
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Sandy Ryza <sa...@cloudera.com>.

Oh and another question - should Spark 2.0 support Java 7?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >> <ni...@gmail.com> wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>>>
>>> >>>> I’m starting a new thread since the other one got intermixed with
>>> >>>> feature requests. Please refrain from making feature request in
>>> this thread.
>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>> features in
>>> >>>> 1.7, 2.1, 2.2, ...
>>> >>>>
>>> >>>> First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>> >>>> major releases in Spark, based on discussion with several members
>>> of the
>>> >>>> community: a major release should be low overhead and minimally
>>> disruptive
>>> >>>> to the Spark community. A major release should not be very
>>> different from a
>>> >>>> minor release and should not be gated based on new features. The
>>> main
>>> >>>> purpose of a major release is an opportunity to fix things that are
>>> broken
>>> >>>> in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> >>>>
>>> >>>> For this reason, I would *not* propose doing major releases to break
>>> >>>> substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>> >>>> incrementally and making changes - and I don't think we want to
>>> change this
>>> >>>> model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> >>>>
>>> >>>> If the community likes the above model, then to me it seems
>>> reasonable
>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately
>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of
>>> >>>> major releases every 2 years seems doable within the above model.
>>> >>>>
>>> >>>> Under this model, here is a list of example things I would propose
>>> doing
>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>> >>>>
>>> >>>>
>>> >>>> APIs
>>> >>>>
>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>> in
>>> >>>> Spark 1.x.
>>> >>>>
>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>> complaints
>>> >>>> about user applications being unable to use Akka due to Spark’s
>>> dependency
>>> >>>> on Akka.
>>> >>>>
>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>> >>>>
>>> >>>> 4. Better class package structure for low level developer API’s. In
>>> >>>> particular, we have some DeveloperApi (mostly various
>>> listener-related
>>> >>>> classes) added over the years. Some packages include only one or
>>> two public
>>> >>>> classes but a lot of private classes. A better structure is to have
>>> public
>>> >>>> classes isolated to a few public packages, and these public
>>> packages should
>>> >>>> have minimal private classes for low level developer APIs.
>>> >>>>
>>> >>>> 5. Consolidate task metric and accumulator API. Although having some
>>> >>>> subtle differences, these two are very similar but have completely
>>> different
>>> >>>> code path.
>>> >>>>
>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>> e.g. in
>>> >>>> ML pipelines, and will be used by streaming also.
>>> >>>>
>>> >>>>
>>> >>>> Operation/Deployment
>>> >>>>
>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>> 2.10,
>>> >>>> but it has been end-of-life.
>>> >>>>
>>> >>>> 2. Remove Hadoop 1 support.
>>> >>>>
>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> >>>> enormous assembly jar in order to run Spark.
>>> >>>>
>>> >>
>>> >
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Sudhir Menon <sm...@pivotal.io>.

Agree. If it is deprecated, get rid of it in 2.0
If the deprecation was a mistake, let's fix that.

Suds
Sent from my iPhone

On Nov 10, 2015, at 5:04 PM, Reynold Xin <rx...@databricks.com> wrote:

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>> +1
>>>>
>>>> On a related note I think making it lightweight will ensure that we
>>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>>> to wait for new features / big architectural changes.
>>>>
>>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>>> fixes to older releases would still apply. I don't think developing
>>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>>> users to switch to 2.x.
>>>>
>>>> Shivaram
>>>>
>>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
>>>> wrote:
>>>> > +1 on a lightweight 2.0
>>>> >
>>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>>> If not
>>>> > terminated, how will we determine what goes into each major version
>>>> line?
>>>> > Will 1.x only be for stability fixes?
>>>> >
>>>> > Thanks,
>>>> > Kostas
>>>> >
>>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I also feel the same as Reynold. I agree we should minimize API
>>>> breaks and
>>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>>> exposing
>>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>>> community.
>>>> >> Ideally a major release is a lightweight process we can do every
>>>> couple of
>>>> >> years, with minimal impact for users.
>>>> >>
>>>> >> - Patrick
>>>> >>
>>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>>> >> <ni...@gmail.com> wrote:
>>>> >>>
>>>> >>> > For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>> > substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>>> >>> > incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>> > model.
>>>> >>>
>>>> >>> +1 for this. The Python community went through a lot of turmoil
>>>> over the
>>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>>> painful
>>>> >>> for too long. The Spark community will benefit greatly from our
>>>> explicitly
>>>> >>> looking to avoid a similar situation.
>>>> >>>
>>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>> > enormous assembly jar in order to run Spark.
>>>> >>>
>>>> >>> Could you elaborate a bit on this? I'm not sure what an
>>>> assembly-free
>>>> >>> distribution means.
>>>> >>>
>>>> >>> Nick
>>>> >>>
>>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> I’m starting a new thread since the other one got intermixed with
>>>> >>>> feature requests. Please refrain from making feature request in
>>>> this thread.
>>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>>> features in
>>>> >>>> 1.7, 2.1, 2.2, ...
>>>> >>>>
>>>> >>>> First - I want to propose a premise for how to think about Spark
>>>> 2.0 and
>>>> >>>> major releases in Spark, based on discussion with several members
>>>> of the
>>>> >>>> community: a major release should be low overhead and minimally
>>>> disruptive
>>>> >>>> to the Spark community. A major release should not be very
>>>> different from a
>>>> >>>> minor release and should not be gated based on new features. The
>>>> main
>>>> >>>> purpose of a major release is an opportunity to fix things that
>>>> are broken
>>>> >>>> in the current API and remove certain deprecated APIs (examples
>>>> follow).
>>>> >>>>
>>>> >>>> For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>>> substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>>> >>>> incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>>> model. In fact, we’ve released many architectural changes on the
>>>> 1.X line.
>>>> >>>>
>>>> >>>> If the community likes the above model, then to me it seems
>>>> reasonable
>>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>>> immediately
>>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>>> cadence of
>>>> >>>> major releases every 2 years seems doable within the above model.
>>>> >>>>
>>>> >>>> Under this model, here is a list of example things I would propose
>>>> doing
>>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>> >>>>
>>>> >>>>
>>>> >>>> APIs
>>>> >>>>
>>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>>> in
>>>> >>>> Spark 1.x.
>>>> >>>>
>>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>>> complaints
>>>> >>>> about user applications being unable to use Akka due to Spark’s
>>>> dependency
>>>> >>>> on Akka.
>>>> >>>>
>>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>> >>>>
>>>> >>>> 4. Better class package structure for low level developer API’s. In
>>>> >>>> particular, we have some DeveloperApi (mostly various
>>>> listener-related
>>>> >>>> classes) added over the years. Some packages include only one or
>>>> two public
>>>> >>>> classes but a lot of private classes. A better structure is to
>>>> have public
>>>> >>>> classes isolated to a few public packages, and these public
>>>> packages should
>>>> >>>> have minimal private classes for low level developer APIs.
>>>> >>>>
>>>> >>>> 5. Consolidate task metric and accumulator API. Although having
>>>> some
>>>> >>>> subtle differences, these two are very similar but have completely
>>>> different
>>>> >>>> code path.
>>>> >>>>
>>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>>> e.g. in
>>>> >>>> ML pipelines, and will be used by streaming also.
>>>> >>>>
>>>> >>>>
>>>> >>>> Operation/Deployment
>>>> >>>>
>>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>> 2.10,
>>>> >>>> but it has been end-of-life.
>>>> >>>>
>>>> >>>> 2. Remove Hadoop 1 support.
>>>> >>>>
>>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>>> enormous assembly jar in order to run Spark.
>>>> >>>>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

Maybe a better idea is to un-deprecate an API if it is too important to not
be removed.

I don't think we can drop Java 7 support. It's way too soon.



On Tue, Nov 10, 2015 at 4:59 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
> we're not going to remove these with a major version change, then just when
> will we remove them?
>
> On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> Another +1 to Reynold's proposal.
>>
>> Maybe this is obvious, but I'd like to advocate against a blanket removal
>> of deprecated / developer APIs.  Many APIs can likely be removed without
>> material impact (e.g. the SparkContext constructor that takes preferred
>> node location data), while others likely see heavier usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful in easing the transition.
>>
>> -Sandy
>>
>> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>>> features to the 1.x line. We should still do critical bug fixes though.
>>>
>>>
>>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>> +1
>>>>
>>>> On a related note I think making it lightweight will ensure that we
>>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>>> to wait for new features / big architectural changes.
>>>>
>>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>>> fixes to older releases would still apply. I don't think developing
>>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>>> users to switch to 2.x.
>>>>
>>>> Shivaram
>>>>
>>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
>>>> wrote:
>>>> > +1 on a lightweight 2.0
>>>> >
>>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>>> If not
>>>> > terminated, how will we determine what goes into each major version
>>>> line?
>>>> > Will 1.x only be for stability fixes?
>>>> >
>>>> > Thanks,
>>>> > Kostas
>>>> >
>>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I also feel the same as Reynold. I agree we should minimize API
>>>> breaks and
>>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>>> exposing
>>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>>> community.
>>>> >> Ideally a major release is a lightweight process we can do every
>>>> couple of
>>>> >> years, with minimal impact for users.
>>>> >>
>>>> >> - Patrick
>>>> >>
>>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>>> >> <ni...@gmail.com> wrote:
>>>> >>>
>>>> >>> > For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>> > substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>>> >>> > incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>> > model.
>>>> >>>
>>>> >>> +1 for this. The Python community went through a lot of turmoil
>>>> over the
>>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>>> painful
>>>> >>> for too long. The Spark community will benefit greatly from our
>>>> explicitly
>>>> >>> looking to avoid a similar situation.
>>>> >>>
>>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>> > enormous assembly jar in order to run Spark.
>>>> >>>
>>>> >>> Could you elaborate a bit on this? I'm not sure what an
>>>> assembly-free
>>>> >>> distribution means.
>>>> >>>
>>>> >>> Nick
>>>> >>>
>>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> I’m starting a new thread since the other one got intermixed with
>>>> >>>> feature requests. Please refrain from making feature request in
>>>> this thread.
>>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>>> features in
>>>> >>>> 1.7, 2.1, 2.2, ...
>>>> >>>>
>>>> >>>> First - I want to propose a premise for how to think about Spark
>>>> 2.0 and
>>>> >>>> major releases in Spark, based on discussion with several members
>>>> of the
>>>> >>>> community: a major release should be low overhead and minimally
>>>> disruptive
>>>> >>>> to the Spark community. A major release should not be very
>>>> different from a
>>>> >>>> minor release and should not be gated based on new features. The
>>>> main
>>>> >>>> purpose of a major release is an opportunity to fix things that
>>>> are broken
>>>> >>>> in the current API and remove certain deprecated APIs (examples
>>>> follow).
>>>> >>>>
>>>> >>>> For this reason, I would *not* propose doing major releases to
>>>> break
>>>> >>>> substantial API's or perform large re-architecting that prevent
>>>> users from
>>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>>> >>>> incrementally and making changes - and I don't think we want to
>>>> change this
>>>> >>>> model. In fact, we’ve released many architectural changes on the
>>>> 1.X line.
>>>> >>>>
>>>> >>>> If the community likes the above model, then to me it seems
>>>> reasonable
>>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>>> immediately
>>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>>> cadence of
>>>> >>>> major releases every 2 years seems doable within the above model.
>>>> >>>>
>>>> >>>> Under this model, here is a list of example things I would propose
>>>> doing
>>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>> >>>>
>>>> >>>>
>>>> >>>> APIs
>>>> >>>>
>>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>>> in
>>>> >>>> Spark 1.x.
>>>> >>>>
>>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>>> complaints
>>>> >>>> about user applications being unable to use Akka due to Spark’s
>>>> dependency
>>>> >>>> on Akka.
>>>> >>>>
>>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>> >>>>
>>>> >>>> 4. Better class package structure for low level developer API’s. In
>>>> >>>> particular, we have some DeveloperApi (mostly various
>>>> listener-related
>>>> >>>> classes) added over the years. Some packages include only one or
>>>> two public
>>>> >>>> classes but a lot of private classes. A better structure is to
>>>> have public
>>>> >>>> classes isolated to a few public packages, and these public
>>>> packages should
>>>> >>>> have minimal private classes for low level developer APIs.
>>>> >>>>
>>>> >>>> 5. Consolidate task metric and accumulator API. Although having
>>>> some
>>>> >>>> subtle differences, these two are very similar but have completely
>>>> different
>>>> >>>> code path.
>>>> >>>>
>>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>>> e.g. in
>>>> >>>> ML pipelines, and will be used by streaming also.
>>>> >>>>
>>>> >>>>
>>>> >>>> Operation/Deployment
>>>> >>>>
>>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>>> 2.10,
>>>> >>>> but it has been end-of-life.
>>>> >>>>
>>>> >>>> 2. Remove Hadoop 1 support.
>>>> >>>>
>>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> >>>> enormous assembly jar in order to run Spark.
>>>> >>>>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: A proposal for Spark 2.0

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Really, Sandy?  "Extra consideration" even for already-deprecated API?  If
we're not going to remove these with a major version change, then just when
will we remove them?

On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> Another +1 to Reynold's proposal.
>
> Maybe this is obvious, but I'd like to advocate against a blanket removal
> of deprecated / developer APIs.  Many APIs can likely be removed without
> material impact (e.g. the SparkContext constructor that takes preferred
> node location data), while others likely see heavier usage (e.g. I wouldn't
> be surprised if mapPartitionsWithContext was baked into a number of apps)
> and merit a little extra consideration.
>
> Maybe also obvious, but I think a migration guide with API equivlents and
> the like would be incredibly useful in easing the transition.
>
> -Sandy
>
> On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Echoing Shivaram here. I don't think it makes a lot of sense to add more
>> features to the 1.x line. We should still do critical bug fixes though.
>>
>>
>> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> On a related note I think making it lightweight will ensure that we
>>> stay on the current release schedule and don't unnecessarily delay 2.0
>>> to wait for new features / big architectural changes.
>>>
>>> In terms of fixes to 1.x, I think our current policy of back-porting
>>> fixes to older releases would still apply. I don't think developing
>>> new features on both 1.x and 2.x makes a lot of sense as we would like
>>> users to switch to 2.x.
>>>
>>> Shivaram
>>>
>>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
>>> wrote:
>>> > +1 on a lightweight 2.0
>>> >
>>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>>> If not
>>> > terminated, how will we determine what goes into each major version
>>> line?
>>> > Will 1.x only be for stability fixes?
>>> >
>>> > Thanks,
>>> > Kostas
>>> >
>>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>> >>
>>> >> I also feel the same as Reynold. I agree we should minimize API
>>> breaks and
>>> >> focus on fixing things around the edge that were mistakes (e.g.
>>> exposing
>>> >> Guava and Akka) rather than any overhaul that could fragment the
>>> community.
>>> >> Ideally a major release is a lightweight process we can do every
>>> couple of
>>> >> years, with minimal impact for users.
>>> >>
>>> >> - Patrick
>>> >>
>>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>>> >> <ni...@gmail.com> wrote:
>>> >>>
>>> >>> > For this reason, I would *not* propose doing major releases to
>>> break
>>> >>> > substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>> > upgrading. Spark has always had a culture of evolving architecture
>>> >>> > incrementally and making changes - and I don't think we want to
>>> change this
>>> >>> > model.
>>> >>>
>>> >>> +1 for this. The Python community went through a lot of turmoil over
>>> the
>>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>>> painful
>>> >>> for too long. The Spark community will benefit greatly from our
>>> explicitly
>>> >>> looking to avoid a similar situation.
>>> >>>
>>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> >>> > enormous assembly jar in order to run Spark.
>>> >>>
>>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> >>> distribution means.
>>> >>>
>>> >>> Nick
>>> >>>
>>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>>>
>>> >>>> I’m starting a new thread since the other one got intermixed with
>>> >>>> feature requests. Please refrain from making feature request in
>>> this thread.
>>> >>>> Not that we shouldn’t be adding features, but we can always add
>>> features in
>>> >>>> 1.7, 2.1, 2.2, ...
>>> >>>>
>>> >>>> First - I want to propose a premise for how to think about Spark
>>> 2.0 and
>>> >>>> major releases in Spark, based on discussion with several members
>>> of the
>>> >>>> community: a major release should be low overhead and minimally
>>> disruptive
>>> >>>> to the Spark community. A major release should not be very
>>> different from a
>>> >>>> minor release and should not be gated based on new features. The
>>> main
>>> >>>> purpose of a major release is an opportunity to fix things that are
>>> broken
>>> >>>> in the current API and remove certain deprecated APIs (examples
>>> follow).
>>> >>>>
>>> >>>> For this reason, I would *not* propose doing major releases to break
>>> >>>> substantial API's or perform large re-architecting that prevent
>>> users from
>>> >>>> upgrading. Spark has always had a culture of evolving architecture
>>> >>>> incrementally and making changes - and I don't think we want to
>>> change this
>>> >>>> model. In fact, we’ve released many architectural changes on the
>>> 1.X line.
>>> >>>>
>>> >>>> If the community likes the above model, then to me it seems
>>> reasonable
>>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately
>>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of
>>> >>>> major releases every 2 years seems doable within the above model.
>>> >>>>
>>> >>>> Under this model, here is a list of example things I would propose
>>> doing
>>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>> >>>>
>>> >>>>
>>> >>>> APIs
>>> >>>>
>>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>> in
>>> >>>> Spark 1.x.
>>> >>>>
>>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>>> complaints
>>> >>>> about user applications being unable to use Akka due to Spark’s
>>> dependency
>>> >>>> on Akka.
>>> >>>>
>>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>> >>>>
>>> >>>> 4. Better class package structure for low level developer API’s. In
>>> >>>> particular, we have some DeveloperApi (mostly various
>>> listener-related
>>> >>>> classes) added over the years. Some packages include only one or
>>> two public
>>> >>>> classes but a lot of private classes. A better structure is to have
>>> public
>>> >>>> classes isolated to a few public packages, and these public
>>> packages should
>>> >>>> have minimal private classes for low level developer APIs.
>>> >>>>
>>> >>>> 5. Consolidate task metric and accumulator API. Although having some
>>> >>>> subtle differences, these two are very similar but have completely
>>> different
>>> >>>> code path.
>>> >>>>
>>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> >>>> moving them to other package(s). They are already used beyond SQL,
>>> e.g. in
>>> >>>> ML pipelines, and will be used by streaming also.
>>> >>>>
>>> >>>>
>>> >>>> Operation/Deployment
>>> >>>>
>>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>>> 2.10,
>>> >>>> but it has been end-of-life.
>>> >>>>
>>> >>>> 2. Remove Hadoop 1 support.
>>> >>>>
>>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> >>>> enormous assembly jar in order to run Spark.
>>> >>>>
>>> >>
>>> >
>>>
>>
>>
>

Re: A proposal for Spark 2.0

Posted by Sandy Ryza <sa...@cloudera.com>.

Another +1 to Reynold's proposal.

Maybe this is obvious, but I'd like to advocate against a blanket removal
of deprecated / developer APIs.  Many APIs can likely be removed without
material impact (e.g. the SparkContext constructor that takes preferred
node location data), while others likely see heavier usage (e.g. I wouldn't
be surprised if mapPartitionsWithContext was baked into a number of apps)
and merit a little extra consideration.

Maybe also obvious, but I think a migration guide with API equivlents and
the like would be incredibly useful in easing the transition.

-Sandy

On Tue, Nov 10, 2015 at 4:28 PM, Reynold Xin <rx...@databricks.com> wrote:

> Echoing Shivaram here. I don't think it makes a lot of sense to add more
> features to the 1.x line. We should still do critical bug fixes though.
>
>
> On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> On a related note I think making it lightweight will ensure that we
>> stay on the current release schedule and don't unnecessarily delay 2.0
>> to wait for new features / big architectural changes.
>>
>> In terms of fixes to 1.x, I think our current policy of back-porting
>> fixes to older releases would still apply. I don't think developing
>> new features on both 1.x and 2.x makes a lot of sense as we would like
>> users to switch to 2.x.
>>
>> Shivaram
>>
>> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
>> wrote:
>> > +1 on a lightweight 2.0
>> >
>> > What is the thinking around the 1.x line after Spark 2.0 is released?
>> If not
>> > terminated, how will we determine what goes into each major version
>> line?
>> > Will 1.x only be for stability fixes?
>> >
>> > Thanks,
>> > Kostas
>> >
>> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> >>
>> >> I also feel the same as Reynold. I agree we should minimize API breaks
>> and
>> >> focus on fixing things around the edge that were mistakes (e.g.
>> exposing
>> >> Guava and Akka) rather than any overhaul that could fragment the
>> community.
>> >> Ideally a major release is a lightweight process we can do every
>> couple of
>> >> years, with minimal impact for users.
>> >>
>> >> - Patrick
>> >>
>> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> >> <ni...@gmail.com> wrote:
>> >>>
>> >>> > For this reason, I would *not* propose doing major releases to break
>> >>> > substantial API's or perform large re-architecting that prevent
>> users from
>> >>> > upgrading. Spark has always had a culture of evolving architecture
>> >>> > incrementally and making changes - and I don't think we want to
>> change this
>> >>> > model.
>> >>>
>> >>> +1 for this. The Python community went through a lot of turmoil over
>> the
>> >>> Python 2 -> Python 3 transition because the upgrade process was too
>> painful
>> >>> for too long. The Spark community will benefit greatly from our
>> explicitly
>> >>> looking to avoid a similar situation.
>> >>>
>> >>> > 3. Assembly-free distribution of Spark: don’t require building an
>> >>> > enormous assembly jar in order to run Spark.
>> >>>
>> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> >>> distribution means.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>>
>> >>>> I’m starting a new thread since the other one got intermixed with
>> >>>> feature requests. Please refrain from making feature request in this
>> thread.
>> >>>> Not that we shouldn’t be adding features, but we can always add
>> features in
>> >>>> 1.7, 2.1, 2.2, ...
>> >>>>
>> >>>> First - I want to propose a premise for how to think about Spark 2.0
>> and
>> >>>> major releases in Spark, based on discussion with several members of
>> the
>> >>>> community: a major release should be low overhead and minimally
>> disruptive
>> >>>> to the Spark community. A major release should not be very different
>> from a
>> >>>> minor release and should not be gated based on new features. The main
>> >>>> purpose of a major release is an opportunity to fix things that are
>> broken
>> >>>> in the current API and remove certain deprecated APIs (examples
>> follow).
>> >>>>
>> >>>> For this reason, I would *not* propose doing major releases to break
>> >>>> substantial API's or perform large re-architecting that prevent
>> users from
>> >>>> upgrading. Spark has always had a culture of evolving architecture
>> >>>> incrementally and making changes - and I don't think we want to
>> change this
>> >>>> model. In fact, we’ve released many architectural changes on the 1.X
>> line.
>> >>>>
>> >>>> If the community likes the above model, then to me it seems
>> reasonable
>> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>> immediately
>> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>> cadence of
>> >>>> major releases every 2 years seems doable within the above model.
>> >>>>
>> >>>> Under this model, here is a list of example things I would propose
>> doing
>> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>> >>>>
>> >>>>
>> >>>> APIs
>> >>>>
>> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> >>>> Spark 1.x.
>> >>>>
>> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
>> complaints
>> >>>> about user applications being unable to use Akka due to Spark’s
>> dependency
>> >>>> on Akka.
>> >>>>
>> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>> >>>>
>> >>>> 4. Better class package structure for low level developer API’s. In
>> >>>> particular, we have some DeveloperApi (mostly various
>> listener-related
>> >>>> classes) added over the years. Some packages include only one or two
>> public
>> >>>> classes but a lot of private classes. A better structure is to have
>> public
>> >>>> classes isolated to a few public packages, and these public packages
>> should
>> >>>> have minimal private classes for low level developer APIs.
>> >>>>
>> >>>> 5. Consolidate task metric and accumulator API. Although having some
>> >>>> subtle differences, these two are very similar but have completely
>> different
>> >>>> code path.
>> >>>>
>> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> >>>> moving them to other package(s). They are already used beyond SQL,
>> e.g. in
>> >>>> ML pipelines, and will be used by streaming also.
>> >>>>
>> >>>>
>> >>>> Operation/Deployment
>> >>>>
>> >>>> 1. Scala 2.11 as the default build. We should still support Scala
>> 2.10,
>> >>>> but it has been end-of-life.
>> >>>>
>> >>>> 2. Remove Hadoop 1 support.
>> >>>>
>> >>>> 3. Assembly-free distribution of Spark: don’t require building an
>> >>>> enormous assembly jar in order to run Spark.
>> >>>>
>> >>
>> >
>>
>
>

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

Echoing Shivaram here. I don't think it makes a lot of sense to add more
features to the 1.x line. We should still do critical bug fixes though.


On Tue, Nov 10, 2015 at 4:23 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> +1
>
> On a related note I think making it lightweight will ensure that we
> stay on the current release schedule and don't unnecessarily delay 2.0
> to wait for new features / big architectural changes.
>
> In terms of fixes to 1.x, I think our current policy of back-porting
> fixes to older releases would still apply. I don't think developing
> new features on both 1.x and 2.x makes a lot of sense as we would like
> users to switch to 2.x.
>
> Shivaram
>
> On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com>
> wrote:
> > +1 on a lightweight 2.0
> >
> > What is the thinking around the 1.x line after Spark 2.0 is released? If
> not
> > terminated, how will we determine what goes into each major version line?
> > Will 1.x only be for stability fixes?
> >
> > Thanks,
> > Kostas
> >
> > On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >>
> >> I also feel the same as Reynold. I agree we should minimize API breaks
> and
> >> focus on fixing things around the edge that were mistakes (e.g. exposing
> >> Guava and Akka) rather than any overhaul that could fragment the
> community.
> >> Ideally a major release is a lightweight process we can do every couple
> of
> >> years, with minimal impact for users.
> >>
> >> - Patrick
> >>
> >> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
> >> <ni...@gmail.com> wrote:
> >>>
> >>> > For this reason, I would *not* propose doing major releases to break
> >>> > substantial API's or perform large re-architecting that prevent
> users from
> >>> > upgrading. Spark has always had a culture of evolving architecture
> >>> > incrementally and making changes - and I don't think we want to
> change this
> >>> > model.
> >>>
> >>> +1 for this. The Python community went through a lot of turmoil over
> the
> >>> Python 2 -> Python 3 transition because the upgrade process was too
> painful
> >>> for too long. The Spark community will benefit greatly from our
> explicitly
> >>> looking to avoid a similar situation.
> >>>
> >>> > 3. Assembly-free distribution of Spark: don’t require building an
> >>> > enormous assembly jar in order to run Spark.
> >>>
> >>> Could you elaborate a bit on this? I'm not sure what an assembly-free
> >>> distribution means.
> >>>
> >>> Nick
> >>>
> >>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>
> >>>> I’m starting a new thread since the other one got intermixed with
> >>>> feature requests. Please refrain from making feature request in this
> thread.
> >>>> Not that we shouldn’t be adding features, but we can always add
> features in
> >>>> 1.7, 2.1, 2.2, ...
> >>>>
> >>>> First - I want to propose a premise for how to think about Spark 2.0
> and
> >>>> major releases in Spark, based on discussion with several members of
> the
> >>>> community: a major release should be low overhead and minimally
> disruptive
> >>>> to the Spark community. A major release should not be very different
> from a
> >>>> minor release and should not be gated based on new features. The main
> >>>> purpose of a major release is an opportunity to fix things that are
> broken
> >>>> in the current API and remove certain deprecated APIs (examples
> follow).
> >>>>
> >>>> For this reason, I would *not* propose doing major releases to break
> >>>> substantial API's or perform large re-architecting that prevent users
> from
> >>>> upgrading. Spark has always had a culture of evolving architecture
> >>>> incrementally and making changes - and I don't think we want to
> change this
> >>>> model. In fact, we’ve released many architectural changes on the 1.X
> line.
> >>>>
> >>>> If the community likes the above model, then to me it seems reasonable
> >>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
> >>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
> cadence of
> >>>> major releases every 2 years seems doable within the above model.
> >>>>
> >>>> Under this model, here is a list of example things I would propose
> doing
> >>>> in Spark 2.0, separated into APIs and Operation/Deployment:
> >>>>
> >>>>
> >>>> APIs
> >>>>
> >>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> >>>> Spark 1.x.
> >>>>
> >>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> >>>> applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
> >>>> about user applications being unable to use Akka due to Spark’s
> dependency
> >>>> on Akka.
> >>>>
> >>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
> >>>>
> >>>> 4. Better class package structure for low level developer API’s. In
> >>>> particular, we have some DeveloperApi (mostly various listener-related
> >>>> classes) added over the years. Some packages include only one or two
> public
> >>>> classes but a lot of private classes. A better structure is to have
> public
> >>>> classes isolated to a few public packages, and these public packages
> should
> >>>> have minimal private classes for low level developer APIs.
> >>>>
> >>>> 5. Consolidate task metric and accumulator API. Although having some
> >>>> subtle differences, these two are very similar but have completely
> different
> >>>> code path.
> >>>>
> >>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
> >>>> moving them to other package(s). They are already used beyond SQL,
> e.g. in
> >>>> ML pipelines, and will be used by streaming also.
> >>>>
> >>>>
> >>>> Operation/Deployment
> >>>>
> >>>> 1. Scala 2.11 as the default build. We should still support Scala
> 2.10,
> >>>> but it has been end-of-life.
> >>>>
> >>>> 2. Remove Hadoop 1 support.
> >>>>
> >>>> 3. Assembly-free distribution of Spark: don’t require building an
> >>>> enormous assembly jar in order to run Spark.
> >>>>
> >>
> >
>

Re: A proposal for Spark 2.0

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

+1

On a related note I think making it lightweight will ensure that we
stay on the current release schedule and don't unnecessarily delay 2.0
to wait for new features / big architectural changes.

In terms of fixes to 1.x, I think our current policy of back-porting
fixes to older releases would still apply. I don't think developing
new features on both 1.x and 2.x makes a lot of sense as we would like
users to switch to 2.x.

Shivaram

On Tue, Nov 10, 2015 at 4:02 PM, Kostas Sakellis <ko...@cloudera.com> wrote:
> +1 on a lightweight 2.0
>
> What is the thinking around the 1.x line after Spark 2.0 is released? If not
> terminated, how will we determine what goes into each major version line?
> Will 1.x only be for stability fixes?
>
> Thanks,
> Kostas
>
> On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>> I also feel the same as Reynold. I agree we should minimize API breaks and
>> focus on fixing things around the edge that were mistakes (e.g. exposing
>> Guava and Akka) rather than any overhaul that could fragment the community.
>> Ideally a major release is a lightweight process we can do every couple of
>> years, with minimal impact for users.
>>
>> - Patrick
>>
>> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
>> <ni...@gmail.com> wrote:
>>>
>>> > For this reason, I would *not* propose doing major releases to break
>>> > substantial API's or perform large re-architecting that prevent users from
>>> > upgrading. Spark has always had a culture of evolving architecture
>>> > incrementally and making changes - and I don't think we want to change this
>>> > model.
>>>
>>> +1 for this. The Python community went through a lot of turmoil over the
>>> Python 2 -> Python 3 transition because the upgrade process was too painful
>>> for too long. The Spark community will benefit greatly from our explicitly
>>> looking to avoid a similar situation.
>>>
>>> > 3. Assembly-free distribution of Spark: don’t require building an
>>> > enormous assembly jar in order to run Spark.
>>>
>>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>>> distribution means.
>>>
>>> Nick
>>>
>>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> I’m starting a new thread since the other one got intermixed with
>>>> feature requests. Please refrain from making feature request in this thread.
>>>> Not that we shouldn’t be adding features, but we can always add features in
>>>> 1.7, 2.1, 2.2, ...
>>>>
>>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>>> major releases in Spark, based on discussion with several members of the
>>>> community: a major release should be low overhead and minimally disruptive
>>>> to the Spark community. A major release should not be very different from a
>>>> minor release and should not be gated based on new features. The main
>>>> purpose of a major release is an opportunity to fix things that are broken
>>>> in the current API and remove certain deprecated APIs (examples follow).
>>>>
>>>> For this reason, I would *not* propose doing major releases to break
>>>> substantial API's or perform large re-architecting that prevent users from
>>>> upgrading. Spark has always had a culture of evolving architecture
>>>> incrementally and making changes - and I don't think we want to change this
>>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>>
>>>> If the community likes the above model, then to me it seems reasonable
>>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>>>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>>>> major releases every 2 years seems doable within the above model.
>>>>
>>>> Under this model, here is a list of example things I would propose doing
>>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>>
>>>>
>>>> APIs
>>>>
>>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>>> Spark 1.x.
>>>>
>>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>>> about user applications being unable to use Akka due to Spark’s dependency
>>>> on Akka.
>>>>
>>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>>
>>>> 4. Better class package structure for low level developer API’s. In
>>>> particular, we have some DeveloperApi (mostly various listener-related
>>>> classes) added over the years. Some packages include only one or two public
>>>> classes but a lot of private classes. A better structure is to have public
>>>> classes isolated to a few public packages, and these public packages should
>>>> have minimal private classes for low level developer APIs.
>>>>
>>>> 5. Consolidate task metric and accumulator API. Although having some
>>>> subtle differences, these two are very similar but have completely different
>>>> code path.
>>>>
>>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>>> ML pipelines, and will be used by streaming also.
>>>>
>>>>
>>>> Operation/Deployment
>>>>
>>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>>> but it has been end-of-life.
>>>>
>>>> 2. Remove Hadoop 1 support.
>>>>
>>>> 3. Assembly-free distribution of Spark: don’t require building an
>>>> enormous assembly jar in order to run Spark.
>>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Kostas Sakellis <ko...@cloudera.com>.

+1 on a lightweight 2.0

What is the thinking around the 1.x line after Spark 2.0 is released? If
not terminated, how will we determine what goes into each major version
line? Will 1.x only be for stability fixes?

Thanks,
Kostas

On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell <pw...@gmail.com> wrote:

> I also feel the same as Reynold. I agree we should minimize API breaks and
> focus on fixing things around the edge that were mistakes (e.g. exposing
> Guava and Akka) rather than any overhaul that could fragment the community.
> Ideally a major release is a lightweight process we can do every couple of
> years, with minimal impact for users.
>
> - Patrick
>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> > For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model.
>>
>> +1 for this. The Python community went through a lot of turmoil over the
>> Python 2 -> Python 3 transition because the upgrade process was too painful
>> for too long. The Spark community will benefit greatly from our explicitly
>> looking to avoid a similar situation.
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>> Nick
>>
>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>

Re: A proposal for Spark 2.0

Posted by Patrick Wendell <pw...@gmail.com>.

I also feel the same as Reynold. I agree we should minimize API breaks and
focus on fixing things around the edge that were mistakes (e.g. exposing
Guava and Akka) rather than any overhaul that could fragment the community.
Ideally a major release is a lightweight process we can do every couple of
years, with minimal impact for users.

- Patrick

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> > For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model.
>
> +1 for this. The Python community went through a lot of turmoil over the
> Python 2 -> Python 3 transition because the upgrade process was too painful
> for too long. The Spark community will benefit greatly from our explicitly
> looking to avoid a similar situation.
>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
> Nick
>
> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>

Re: A proposal for Spark 2.0

Posted by Josh Rosen <jo...@databricks.com>.

There's a proposal / discussion of the assembly-less distributions at
https://github.com/vanzin/spark/pull/2/files /
https://issues.apache.org/jira/browse/SPARK-11157.

On Tue, Nov 10, 2015 at 3:53 PM, Reynold Xin <rx...@databricks.com> wrote:

>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>>
> Right now we ship Spark using a single assembly jar, which causes a few
> different problems:
>
> - total number of classes are limited on some configurations
>
> - dependency swapping is harder
>
>
> The proposal is to just avoid a single fat jar.
>
>
>

Re: A proposal for Spark 2.0

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

I fully agree that. Actually, I'm working on PR to add "client" and 
"exploded" profiles in Maven build.

The client profile create a spark-client-assembly jar, largely more 
lightweight that the spark-assembly. In our case, we construct jobs that 
don't require all the spark server side. It means that the minimal size 
of the generated jar is about 120MB, and it's painful in spark-submit 
submission time. That's why I started to remove unecessary dependencies 
in spark-assembly.

On the other hand, I'm also working on the "exploded" mode: instead of 
using a fat monolithic spark-assembly jar file, I'm working on a 
exploded mode, allowing users to view/change the dependencies.

For the client profile, I've already something ready, I will propose the 
PR very soon (by the end of this week hopefully). For the exploded 
profile, I need more time.

My $0.02

Regards
JB

On 11/11/2015 12:53 AM, Reynold Xin wrote:
>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas
> <nicholas.chammas@gmail.com <ma...@gmail.com>> wrote:
>
>
>     > 3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.
>
>     Could you elaborate a bit on this? I'm not sure what an
>     assembly-free distribution means.
>
>
> Right now we ship Spark using a single assembly jar, which causes a few
> different problems:
>
> - total number of classes are limited on some configurations
>
> - dependency swapping is harder
>
>
> The proposal is to just avoid a single fat jar.
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by Reynold Xin <rx...@databricks.com>.

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
>
Right now we ship Spark using a single assembly jar, which causes a few
different problems:

- total number of classes are limited on some configurations

- dependency swapping is harder


The proposal is to just avoid a single fat jar.

Re: A proposal for Spark 2.0

Posted by Nicholas Chammas <ni...@gmail.com>.

> For this reason, I would *not* propose doing major releases to break
substantial API's or perform large re-architecting that prevent users from
upgrading. Spark has always had a culture of evolving architecture
incrementally and making changes - and I don't think we want to change this
model.

+1 for this. The Python community went through a lot of turmoil over the
Python 2 -> Python 3 transition because the upgrade process was too painful
for too long. The Spark community will benefit greatly from our explicitly
looking to avoid a similar situation.

> 3. Assembly-free distribution of Spark: don’t require building an
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an assembly-free
distribution means.

Nick

On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin <rx...@databricks.com> wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: A proposal for Spark 2.0

Posted by hitoshi <oz...@worksap.co.jp>.

Resending my earlier message because it wasn't accepted.

Would like to add a proposal to upgrade jars when they do not break APIs and
fixes a bug. 
To be more specific, I would like to see Kryo to be upgraded from 2.21 to
3.x. Kryo 2.x has a bug (e.g.SPARK-7708) that is blocking it usage in
production environment. 
Other projects like Chill is also wanting to upgrade Kryo to 3.x but being
blocked because Spark won't upgrade. I think OSS community at large will
benefit if we can coordinate to upgrade to Kryo 3.x 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15164.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: A proposal for Spark 2.0

Posted by hitoshi <oz...@worksap.co.jp>.

It looks like Chill is willing to upgrade their Kryo to 3.x if Spark and Hive
will. As it is now Spark, Chill, and Hive have Kryo jar but it really can't
be used because Kryo 2 can't serdes some classes. Since Spark 2.0 is a major
release, it really would be nice if we can resolve the Kryo issue.
 
https://github.com/twitter/chill/pull/230#issuecomment-155845959



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-tp15122p15163.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org