You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2018/04/04 23:20:01 UTC

time for Apache Spark 3.0?

There was a discussion thread on scala-contributors
<https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-failure/1747>
about Apache Spark not yet supporting Scala 2.12, and that got me to think
perhaps it is about time for Spark to work towards the 3.0 release. By the
time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more
context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to
3.0).

3. That said, a major version isn’t necessarily the playground for
disruptive API changes to make it painful for users to update. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing
major new features incrementally, so major releases are not the only time
for exciting new features. For example, the bulk of the work in the move
towards the DataFrame API was done in Spark 1.3, and Continuous Processing
was introduced in Spark 2.3. Both were feature releases rather than major
releases.

You can find more background in the thread discussing Spark 2.0:
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html

The primary motivating factor IMO for a major version bump is to support
Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
Similar to Spark 2.0, I think there are also opportunities for other
changes that we know have been biting us for a long time but can’t be
changed in feature releases (to be clear, I’m actually not sure they are
all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant,
to prevent users from shooting themselves in the foot, e.g. “SELECT 2
SECOND” -- is “SECOND” an interval unit or an alias? To make it less
painful for users to upgrade here, I’d suggest creating a flag for backward
compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard
compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g.
“JavaPairRDD flatMapValues requires function returning Iterable, not
Iterator”, “Prevent column name duplication in temporary view”).

Now the reality of a major version bump is that the world often thinks in
terms of what exciting features are coming. I do think there are a number
of major changes happening already that can be part of the 3.0 release, if
they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is
realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...

Similar to the 2.0 discussion, this thread should focus on the framework
and whether it’d make sense to create Spark 3.0 as the next release, rather
than the individual feature requests. Those are important but are best done
in their own separate threads.

Re: time for Apache Spark 3.0?

Posted by Steve Loughran <st...@hortonworks.com>.

On 5 Apr 2018, at 18:04, Matei Zaharia <ma...@gmail.com>> wrote:

Java 9/10 support would be great to add as well.

Be aware that the work moving hadoop core to java 9+ is still a big piece of work being undertaken by Akira Ajisaka & colleagues at NTT

https://issues.apache.org/jira/browse/HADOOP-11123

Big dependency updates and handling Oracle hiding sun.misc stuff which low level code depends on are the troublespots, with a move to Log4J 2 going to be observably traumatic to all apps which require a log4.properties to set themselves up. As usual: any testing which can be done early will be welcomed by all, the earlier the better

That stuff is all about getting things working: supporting the java 9 packaging model. Which is a really compelling reason to go for it

Regarding Scala 2.12, I thought that supporting it would become easier if we change the Spark API and ABI slightly. Basically, it is of course possible to create an alternate source tree today, but it might be possible to share the same source files if we tweak some small things in the methods that are overloaded across Scala and Java. I don’t remember the exact details, but the idea was to reduce the total maintenance work needed at the cost of requiring users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency shading (a major pain point for lots of users)

Hadoop 3 does have a shaded client, though not enough for Spark; if work identifying & fixing the outstanding dependencies is started now, Hadoop 3.2 should be able to offer the set of shaded libraries needed by Spark.

There's always a price to that, which is in redistributable size and it's impact on start times, duplicate classes loaded (memory, reduced chance of JIT recompilation, ...), and the whole transitive-shading problem. Java 9 should be the real target for a clean solution to all of this.

Re: time for Apache Spark 3.0?

Posted by Matei Zaharia <ma...@gmail.com>.

Java 9/10 support would be great to add as well.

Regarding Scala 2.12, I thought that supporting it would become easier if we change the Spark API and ABI slightly. Basically, it is of course possible to create an alternate source tree today, but it might be possible to share the same source files if we tweak some small things in the methods that are overloaded across Scala and Java. I don’t remember the exact details, but the idea was to reduce the total maintenance work needed at the cost of requiring users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency shading (a major pain point for lots of users). It’s also a chance to highlight Kubernetes, continuous processing and other features more if they become “GA".

Matei

> On Apr 5, 2018, at 9:04 AM, Marco Gaido <ma...@gmail.com> wrote:
> 
> Hi all,
> 
> I also agree with Mark that we should add Java 9/10 support to an eventual Spark 3.0 release, because supporting Java 9 is not a trivial task since we are using some internal APIs for the memory management which changed: either we find a solution which works on both (but I am not sure it is feasible) or we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
> 
> Thanks,
> Marco
> 
> 
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <ma...@clearstorydata.com>:
> As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we should also be looking at using the latest Scala 2.11.x maintenance release in current Spark branches.
> 
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12 simultaneously. The cross-build already works now in 2.3.0. Barring some big change needed to get 2.12 fully working -- and that may be the case -- it nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in byte code. However Scala itself isn't mutually compatible between 2.11 and 2.12 anyway; that's never been promised as compatible.
> 
> (Interesting question about what *Java* users should expect; they would see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
> 
> I don't disagree with shooting for Spark 3.0, just saying I don't know if 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 2.11 support if needed to make supporting 2.12 less painful.
> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Sean Owen <sr...@gmail.com>.

That certainly sounds beneficial, to maybe several other projects. If
there's no downside and it takes away API issues, seems like a win.

On Thu, Apr 19, 2018 at 5:28 AM Dean Wampler <de...@gmail.com> wrote:

> I spoke with Martin Odersky and Lightbend's Scala Team about the known API
> issue with method disambiguation. They offered to implement a small patch
> in a new release of Scala 2.12 to handle the issue without requiring a
> Spark API change. They would cut a 2.12.6 release for it. I'm told that
> Scala 2.13 should already handle the issue without modification (it's not
> yet released, to be clear). They can also offer feedback on updating the
> closure cleaner.
>
> So, this approach would support Scala 2.12 in Spark, but limited to
> 2.12.6+, without the API change requirement, but the closure cleaner would
> still need updating. Hence, it could be done for Spark 2.X.
>
> Let me if you want to pursue this approach.
>
> dean
>

Re: time for Apache Spark 3.0?

Posted by Dean Wampler <de...@gmail.com>.

I spoke with Martin Odersky and Lightbend's Scala Team about the known API
issue with method disambiguation. They offered to implement a small patch
in a new release of Scala 2.12 to handle the issue without requiring a
Spark API change. They would cut a 2.12.6 release for it. I'm told that
Scala 2.13 should already handle the issue without modification (it's not
yet released, to be clear). They can also offer feedback on updating the
closure cleaner.

So, this approach would support Scala 2.12 in Spark, but limited to
2.12.6+, without the API change requirement, but the closure cleaner would
still need updating. Hence, it could be done for Spark 2.X.

Let me if you want to pursue this approach.

dean

*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do>, Fast Data Architectures
for Streaming Applications
<http://www.oreilly.com/data/free/fast-data-architectures-for-streaming-applications.csp>,
and other content from O'Reilly
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
> > Sorry, but just to be clear here, this is the 2.12 API issue:
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in
> this doc: https://docs.google.com/document/d/1P_
> wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> >
> > Basically, if we are allowed to change Spark’s API a little to have only
> one version of methods that are currently overloaded between Java and
> Scala, we can get away with a single source three for all Scala versions
> and Java ABI compatibility against any type of Spark (whether using Scala
> 2.11 or 2.12).
>
> Fair enough. To play devil's advocate, most of those methods seem to
> be marked "Experimental / Evolving", which could be used as a reason
> to change them for this purpose in a minor release.
>
> Not all of them are, though (e.g. foreach / foreachPartition are not
> experimental).
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: time for Apache Spark 3.0?

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia <ma...@gmail.com> wrote:
> Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12).

Fair enough. To play devil's advocate, most of those methods seem to
be marked "Experimental / Evolving", which could be used as a reason
to change them for this purpose in a minor release.

Not all of them are, though (e.g. foreach / foreachPartition are not
experimental).

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Matei Zaharia <ma...@gmail.com>.

Oh, forgot to add, but splitting the source tree in Scala also creates the issue of a big maintenance burden for third-party libraries built on Spark. As Josh said on the JIRA:

"I think this is primarily going to be an issue for end users who want to use an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus the pain of the source incompatibility would mostly be felt by library/package maintainers but it can be worked around as long as there's at least some common subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia <ma...@gmail.com> wrote:
> 
> Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> 
> Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On the other hand, if we want to keep the API and ABI of the Spark 2.x branch, we’ll need a different source tree for Scala 2.12 with different copies of pretty large classes such as RDD, DataFrame and DStream, and Java users may have to change their code when linking against different versions of Spark.
> 
> This is of course only one of the possible ABI changes, but it is a considerable engineering effort, so we’d have to sign up for maintaining all these different source files. It seems kind of silly given that Scala 2.12 was released in 2016, so we’re doing all this work to keep ABI compatibility for Scala 2.11, which isn’t even that widely used any more for new projects. Also keep in mind that the next Spark release will probably take at least 3-4 months, so we’re talking about what people will be using in fall 2018.
> 
> Matei
> 
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
>> 
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>> 
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>> 
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>> 
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <ma...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>> 
>>> Thanks,
>>> Marco
>>> 
>>> 
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <ma...@clearstorydata.com>:
>>>> 
>>>> As with Sean, I'm not sure that this will require a new major version, but
>>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>>> to their better functionality in a containerized environment (memory limits
>>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>>> current Spark branches.
>>>> 
>>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>> 
>>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>>> 
>>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>>>> that we know have been biting us for a long time but can’t be changed in
>>>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>>>> ideas, but I’m writing them down as candidates for consideration):
>>>>> 
>>>>> 
>>>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>>>> nearly works that way now.
>>>>> 
>>>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>>>> and 2.12 anyway; that's never been promised as compatible.
>>>>> 
>>>>> (Interesting question about what *Java* users should expect; they would
>>>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>>> 
>>>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>>>> 2.11 support if needed to make supporting 2.12 less painful.
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Marcelo
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Matei Zaharia <ma...@gmail.com>.

Sorry, but just to be clear here, this is the 2.12 API issue: https://issues.apache.org/jira/browse/SPARK-14643, with more details in this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.

Basically, if we are allowed to change Spark’s API a little to have only one version of methods that are currently overloaded between Java and Scala, we can get away with a single source three for all Scala versions and Java ABI compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On the other hand, if we want to keep the API and ABI of the Spark 2.x branch, we’ll need a different source tree for Scala 2.12 with different copies of pretty large classes such as RDD, DataFrame and DStream, and Java users may have to change their code when linking against different versions of Spark.

This is of course only one of the possible ABI changes, but it is a considerable engineering effort, so we’d have to sign up for maintaining all these different source files. It seems kind of silly given that Scala 2.12 was released in 2016, so we’re doing all this work to keep ABI compatibility for Scala 2.11, which isn’t even that widely used any more for new projects. Also keep in mind that the next Spark release will probably take at least 3-4 months, so we’re talking about what people will be using in fall 2018.

Matei

> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
> 
> I remember seeing somewhere that Scala still has some issues with Java
> 9/10 so that might be hard...
> 
> But on that topic, it might be better to shoot for Java 11
> compatibility. 9 and 10, following the new release model, aren't
> really meant to be long-term releases.
> 
> In general, agree with Sean here. Doesn't look like 2.12 support
> requires unexpected API breakages. So unless there's a really good
> reason to break / remove a bunch of existing APIs...
> 
> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <ma...@gmail.com> wrote:
>> Hi all,
>> 
>> I also agree with Mark that we should add Java 9/10 support to an eventual
>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>> are using some internal APIs for the memory management which changed: either
>> we find a solution which works on both (but I am not sure it is feasible) or
>> we have to switch between 2 implementations according to the Java version.
>> So I'd rather avoid doing this in a non-major release.
>> 
>> Thanks,
>> Marco
>> 
>> 
>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <ma...@clearstorydata.com>:
>>> 
>>> As with Sean, I'm not sure that this will require a new major version, but
>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>> to their better functionality in a containerized environment (memory limits
>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>> current Spark branches.
>>> 
>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:
>>>> 
>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>> 
>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>>> that we know have been biting us for a long time but can’t be changed in
>>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>>> ideas, but I’m writing them down as candidates for consideration):
>>>> 
>>>> 
>>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>>> nearly works that way now.
>>>> 
>>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>>> and 2.12 anyway; that's never been promised as compatible.
>>>> 
>>>> (Interesting question about what *Java* users should expect; they would
>>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>> 
>>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>>> 2.11 support if needed to make supporting 2.12 less painful.
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Marcelo
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Marcelo Vanzin <va...@cloudera.com>.

I remember seeing somewhere that Scala still has some issues with Java
9/10 so that might be hard...

But on that topic, it might be better to shoot for Java 11
compatibility. 9 and 10, following the new release model, aren't
really meant to be long-term releases.

In general, agree with Sean here. Doesn't look like 2.12 support
requires unexpected API breakages. So unless there's a really good
reason to break / remove a bunch of existing APIs...

On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <ma...@gmail.com> wrote:
> Hi all,
>
> I also agree with Mark that we should add Java 9/10 support to an eventual
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
> are using some internal APIs for the memory management which changed: either
> we find a solution which works on both (but I am not sure it is feasible) or
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
>
> Thanks,
> Marco
>
>
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <ma...@clearstorydata.com>:
>>
>> As with Sean, I'm not sure that this will require a new major version, but
>> we should also be looking at Java 9 & 10 support -- particularly with regard
>> to their better functionality in a containerized environment (memory limits
>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>> also be looking at using the latest Scala 2.11.x maintenance release in
>> current Spark branches.
>>
>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> The primary motivating factor IMO for a major version bump is to support
>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>> Similar to Spark 2.0, I think there are also opportunities for other changes
>>>> that we know have been biting us for a long time but can’t be changed in
>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>> ideas, but I’m writing them down as candidates for consideration):
>>>
>>>
>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>> nearly works that way now.
>>>
>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>> and 2.12 anyway; that's never been promised as compatible.
>>>
>>> (Interesting question about what *Java* users should expect; they would
>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>
>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>>
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Marco Gaido <ma...@gmail.com>.

Hi all,

I also agree with Mark that we should add Java 9/10 support to an eventual
Spark 3.0 release, because supporting Java 9 is not a trivial task since we
are using some internal APIs for the memory management which changed:
either we find a solution which works on both (but I am not sure it is
feasible) or we have to switch between 2 implementations according to the
Java version.
So I'd rather avoid doing this in a non-major release.

Thanks,
Marco


2018-04-05 17:35 GMT+02:00 Mark Hamstra <ma...@clearstorydata.com>:

> As with Sean, I'm not sure that this will require a new major version, but
> we should also be looking at Java 9 & 10 support -- particularly with
> regard to their better functionality in a containerized environment (memory
> limits from cgroups, not sysconf; support for cpusets). In that regard, we
> should also be looking at using the latest Scala 2.11.x maintenance release
> in current Spark branches.
>
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> The primary motivating factor IMO for a major version bump is to support
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>> Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>
>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>> simultaneously. The cross-build already works now in 2.3.0. Barring some
>> big change needed to get 2.12 fully working -- and that may be the case --
>> it nearly works that way now.
>>
>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>> in byte code. However Scala itself isn't mutually compatible between 2.11
>> and 2.12 anyway; that's never been promised as compatible.
>>
>> (Interesting question about what *Java* users should expect; they would
>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>
>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>
>

Re: time for Apache Spark 3.0?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

As with Sean, I'm not sure that this will require a new major version, but
we should also be looking at Java 9 & 10 support -- particularly with
regard to their better functionality in a containerized environment (memory
limits from cgroups, not sysconf; support for cpusets). In that regard, we
should also be looking at using the latest Scala 2.11.x maintenance release
in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sr...@gmail.com> wrote:

> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some
> big change needed to get 2.12 fully working -- and that may be the case --
> it nearly works that way now.
>
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
> byte code. However Scala itself isn't mutually compatible between 2.11 and
> 2.12 anyway; that's never been promised as compatible.
>
> (Interesting question about what *Java* users should expect; they would
> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>
> I don't disagree with shooting for Spark 3.0, just saying I don't know if
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
> 2.11 support if needed to make supporting 2.12 less painful.
>

Re: time for Apache Spark 3.0?

Posted by Sean Owen <sr...@gmail.com>.

On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <rx...@databricks.com> wrote:

> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>

IIRC from looking at this, it is possible to support 2.11 and 2.12
simultaneously. The cross-build already works now in 2.3.0. Barring some
big change needed to get 2.12 fully working -- and that may be the case --
it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
byte code. However Scala itself isn't mutually compatible between 2.11 and
2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see
a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if
2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
2.11 support if needed to make supporting 2.12 less painful.

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

I definitely agree we shouldn't make dsv2 stable in the next release.

On Thu, Sep 6, 2018 at 9:48 AM Ryan Blue <rb...@netflix.com> wrote:

> I definitely support moving to 3.0 to remove deprecations and update
> dependencies.
>
> For the v2 work, we know that there will be a major API changes and
> standardization of behavior from the new logical plans going into the next
> release. I think it is a safe bet that this isn’t going to be completely
> done for the next release, so it will still be experimental or unstable for
> 3.0. I also expect that there will be some things that we want to
> deprecate. Ideally, that deprecation could happen before a major release so
> we can remove it.
>
> I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
> 4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
> consider it when deciding what the next release should be.
>
> It is probably better to release 3.0 now because it isn’t clear when the
> v2 API will become stable. And if we choose to release 3.0 next, we should
> *not* aim to stabilize v2 for that release. Not that we shouldn’t try to
> make it stable as soon as possible, I just think that it is unlikely to
> happen in time and we should not rush to claim it is stable.
>
> rb
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: time for Apache Spark 3.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I definitely support moving to 3.0 to remove deprecations and update
dependencies.

For the v2 work, we know that there will be a major API changes and
standardization of behavior from the new logical plans going into the next
release. I think it is a safe bet that this isn’t going to be completely
done for the next release, so it will still be experimental or unstable for
3.0. I also expect that there will be some things that we want to
deprecate. Ideally, that deprecation could happen before a major release so
we can remove it.

I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
consider it when deciding what the next release should be.

It is probably better to release 3.0 now because it isn’t clear when the v2
API will become stable. And if we choose to release 3.0 next, we should
*not* aim to stabilize v2 for that release. Not that we shouldn’t try to
make it stable as soon as possible, I just think that it is unlikely to
happen in time and we should not rush to claim it is stable.

rb

On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: time for Apache Spark 3.0?

Posted by Matt Cheah <mc...@palantir.com>.

I just added the label to https://issues.apache.org/jira/browse/SPARK-25908. Unsure if there are any others. I’ll look through the tickets and see if there are any that are missing the label.

-Matt Cheah

From: Sean Owen <sr...@apache.org>
Date: Tuesday, November 13, 2018 at 12:09 PM
To: Matt Cheah <mc...@palantir.com>
Cc: Sean Owen <sr...@apache.org>, Vinoo Ganesh <vg...@palantir.com>, dev <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

As far as I know any JIRA that has implications for users is tagged this way but I haven't examined all of them. All that are going in for 3.0 should have it as Fix Version . Most changes won't have a user visible impact. Do you see any that seem to need the tag? Call em out or even fix them by adding the tag and proposed release notes. 

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah <mcheah@palantir.com wrote:

The release-notes label on JIRA sounds good. Can we make it a point to have that done retroactively now, and then moving forward?

On 11/12/18, 4:01 PM, "Sean Owen" <sr...@apache.org> wrote:

    My non-definitive takes --

    I would personally like to remove all deprecated methods for Spark 3.
    I started by removing 'old' deprecated methods in that commit. Things
    deprecated in 2.4 are maybe less clear, whether they should be removed

    Everything's fair game for removal or change in a major release. So
    far some items in discussion seem to be Scala 2.11 support, Python 2
    support, R support before 3.4. I don't know about other APIs.

    Generally, take a look at JIRA for items targeted at version 3.0. Not
    everything targeted for 3.0 is going in, but ones from committers are
    more likely than others. Breaking changes ought to be tagged
    'release-notes' with a description of the change. The release itself
    has a migration guide that's being updated as we go.

    On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah <mc...@palantir.com> wrote:
    >
    > I wanted to clarify what categories of APIs are eligible to be broken in Spark 3.0. Specifically:
    >
    >
    >
    > Are we removing all deprecated methods? If we’re only removing some subset of deprecated methods, what is that subset? I see a bunch were removed in https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk&s=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88&e= for example. Are we only committed to removing methods that were deprecated in some Spark version and earlier?
    > Aside from removing support for Scala 2.11, what other kinds of (non-experimental and non-evolving) APIs are eligible to be broken?
    > Is there going to be a way to track the current list of all proposed breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can be filtered down to somehow?
    >

    ---------------------------------------------------------------------
    To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Sean Owen <sr...@apache.org>.

As far as I know any JIRA that has implications for users is tagged this
way but I haven't examined all of them. All that are going in for 3.0
should have it as Fix Version . Most changes won't have a user visible
impact. Do you see any that seem to need the tag? Call em out or even fix
them by adding the tag and proposed release notes.

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah <mcheah@palantir.com wrote:

> The release-notes label on JIRA sounds good. Can we make it a point to
> have that done retroactively now, and then moving forward?
>
> On 11/12/18, 4:01 PM, "Sean Owen" <sr...@apache.org> wrote:
>
>     My non-definitive takes --
>
>     I would personally like to remove all deprecated methods for Spark 3.
>     I started by removing 'old' deprecated methods in that commit. Things
>     deprecated in 2.4 are maybe less clear, whether they should be removed
>
>     Everything's fair game for removal or change in a major release. So
>     far some items in discussion seem to be Scala 2.11 support, Python 2
>     support, R support before 3.4. I don't know about other APIs.
>
>     Generally, take a look at JIRA for items targeted at version 3.0. Not
>     everything targeted for 3.0 is going in, but ones from committers are
>     more likely than others. Breaking changes ought to be tagged
>     'release-notes' with a description of the change. The release itself
>     has a migration guide that's being updated as we go.
>
>
>     On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah <mc...@palantir.com>
> wrote:
>     >
>     > I wanted to clarify what categories of APIs are eligible to be
> broken in Spark 3.0. Specifically:
>     >
>     >
>     >
>     > Are we removing all deprecated methods? If we’re only removing some
> subset of deprecated methods, what is that subset? I see a bunch were
> removed in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk&s=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88&e=
> for example. Are we only committed to removing methods that were deprecated
> in some Spark version and earlier?
>     > Aside from removing support for Scala 2.11, what other kinds of
> (non-experimental and non-evolving) APIs are eligible to be broken?
>     > Is there going to be a way to track the current list of all proposed
> breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
> ticket that can be filtered down to somehow?
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>

Re: time for Apache Spark 3.0?

Posted by Matt Cheah <mc...@palantir.com>.

The release-notes label on JIRA sounds good. Can we make it a point to have that done retroactively now, and then moving forward?

On 11/12/18, 4:01 PM, "Sean Owen" <sr...@apache.org> wrote:

    My non-definitive takes --
    
    I would personally like to remove all deprecated methods for Spark 3.
    I started by removing 'old' deprecated methods in that commit. Things
    deprecated in 2.4 are maybe less clear, whether they should be removed
    
    Everything's fair game for removal or change in a major release. So
    far some items in discussion seem to be Scala 2.11 support, Python 2
    support, R support before 3.4. I don't know about other APIs.
    
    Generally, take a look at JIRA for items targeted at version 3.0. Not
    everything targeted for 3.0 is going in, but ones from committers are
    more likely than others. Breaking changes ought to be tagged
    'release-notes' with a description of the change. The release itself
    has a migration guide that's being updated as we go.
    
    
    On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah <mc...@palantir.com> wrote:
    >
    > I wanted to clarify what categories of APIs are eligible to be broken in Spark 3.0. Specifically:
    >
    >
    >
    > Are we removing all deprecated methods? If we’re only removing some subset of deprecated methods, what is that subset? I see a bunch were removed in https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk&s=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88&e= for example. Are we only committed to removing methods that were deprecated in some Spark version and earlier?
    > Aside from removing support for Scala 2.11, what other kinds of (non-experimental and non-evolving) APIs are eligible to be broken?
    > Is there going to be a way to track the current list of all proposed breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can be filtered down to somehow?
    >
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Sean Owen <sr...@apache.org>.

My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.

On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah <mc...@palantir.com> wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some subset of deprecated methods, what is that subset? I see a bunch were removed in https://github.com/apache/spark/pull/22921 for example. Are we only committed to removing methods that were deprecated in some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of (non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can be filtered down to somehow?
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

All API removal and deprecation JIRAs should be tagged "releasenotes", so
we can reference them when we build release notes. I don't know if
everybody is still following that practice, but it'd be great to do that.
Since we don't have that many PRs, we should still be able to retroactively
tag.

We can also add a new tag for API changes, but I feel at this stage it
might be easier to just use "releasenotes".


On Mon, Nov 12, 2018 at 3:49 PM Matt Cheah <mc...@palantir.com> wrote:

> I wanted to clarify what categories of APIs are eligible to be broken in
> Spark 3.0. Specifically:
>
>
>
>    - Are we removing all deprecated methods? If we’re only removing some
>    subset of deprecated methods, what is that subset? I see a bunch were
>    removed in https://github.com/apache/spark/pull/22921 for example. Are
>    we only committed to removing methods that were deprecated in some Spark
>    version and earlier?
>    - Aside from removing support for Scala 2.11, what other kinds of
>    (non-experimental and non-evolving) APIs are eligible to be broken?
>    - Is there going to be a way to track the current list of all proposed
>    breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
>    ticket that can be filtered down to somehow?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
> *From: *Vinoo Ganesh <vg...@palantir.com>
> *Date: *Monday, November 12, 2018 at 2:48 PM
> *To: *Reynold Xin <rx...@databricks.com>
> *Cc: *Xiao Li <ga...@gmail.com>, Matei Zaharia <
> matei.zaharia@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <
> mark@clearstorydata.com>, dev <de...@spark.apache.org>
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Makes sense, thanks Reynold.
>
>
>
> *From: *Reynold Xin <rx...@databricks.com>
> *Date: *Monday, November 12, 2018 at 16:57
> *To: *Vinoo Ganesh <vg...@palantir.com>
> *Cc: *Xiao Li <ga...@gmail.com>, Matei Zaharia <
> matei.zaharia@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <
> mark@clearstorydata.com>, dev <de...@spark.apache.org>
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
> 3.0. In terms of time lining, unless we change anything specifically, Spark
> feature releases are on a 6-mo cadence. Spark 2.4 was just released last
> week, so 3.0 will be roughly 6 month from now.
>
>
>
> On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh <vg...@palantir.com> wrote:
>
> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin <rx...@databricks.com>
> *Cc: *Matei Zaharia <ma...@gmail.com>, Ryan Blue <
> rblue@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, "
> user@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin <rx...@databricks.com> 于2018年9月28日周五 下午11:05写道：
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com> wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY&s=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc&e=>
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
> >
> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> >
> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
> >
> > Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
> >
> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: time for Apache Spark 3.0?

Posted by Matt Cheah <mc...@palantir.com>.

I wanted to clarify what categories of APIs are eligible to be broken in Spark 3.0. Specifically:

Are we removing all deprecated methods? If we’re only removing some subset of deprecated methods, what is that subset? I see a bunch were removed in https://github.com/apache/spark/pull/22921 for example. Are we only committed to removing methods that were deprecated in some Spark version and earlier?
Aside from removing support for Scala 2.11, what other kinds of (non-experimental and non-evolving) APIs are eligible to be broken?
Is there going to be a way to track the current list of all proposed breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can be filtered down to somehow?

Thanks,

-Matt Cheah

From: Vinoo Ganesh <vg...@palantir.com>
Date: Monday, November 12, 2018 at 2:48 PM
To: Reynold Xin <rx...@databricks.com>
Cc: Xiao Li <ga...@gmail.com>, Matei Zaharia <ma...@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, dev <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

Makes sense, thanks Reynold. 

From: Reynold Xin <rx...@databricks.com>
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh <vg...@palantir.com>
Cc: Xiao Li <ga...@gmail.com>, Matei Zaharia <ma...@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, dev <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. In terms of time lining, unless we change anything specifically, Spark feature releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh <vg...@palantir.com> wrote:

Quickly following up on this – is there a target date for when Spark 3.0 may be released and/or a list of the likely api breaks that are anticipated? 

From: Xiao Li <ga...@gmail.com>
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin <rx...@databricks.com>
Cc: Matei Zaharia <ma...@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, "user@spark.apache.org" <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change. 

Reynold Xin <rx...@databricks.com> 于2018年9月28日周五 下午11:05写道：

i think we should create spips for some of them, since they are pretty large ... i can create some tickets to start with 

--

excuse the brevity and lower case due to wrist injury

On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com> wrote:

Based on the above discussions, we have a "rough consensus" that the next release will be 3.0. Now, we can start working on the API breaking changes (e.g., the ones mentioned in the original email from Reynold). 

Cheers,

Xiao 

Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：

Yes, you can start with Unstable and move to Evolving and Stable when needed. We’ve definitely had experimental features that changed across maintenance releases when they were well-isolated. If your change risks breaking stuff in stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was suggesting was getting v2 code out more often than the point releases every 6 months. An Evolving API can change in point releases, but maybe we should move v2 to Unstable so it can change more often? I don't really see another way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com> wrote:
> Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java [github.com]
> 
> As long as it is properly annotated, we can change or even eliminate an API method before the next major release. And frankly, we shouldn't be contemplating bringing in the DS v2 API (and, I'd argue, any new API) without such an annotation. There is just too much risk of not getting everything right before we see the results of the new API being more widely used, and too much cost in maintaining until the next major release something that we come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> It would be great to get more features out incrementally. For experimental features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, dsv1 and a lot of other major features all were developed throughout the 1.x and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there more frequently, to get feedback. Maybe that means we apply additive changes to 2.4.x; maybe that means making another 2.5 release sooner. I will start a separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 6 months?) but simply next. Do you mean you'd prefer that change to happen before 3.x? if it's a significant change, seems reasonable for a major version bump rather than minor. Is the concern that tying it to 3.0 means you have to take a major version update to get it?
> 
> I generally support moving on to 3.x so we can also jettison a lot of older dependencies, code, fix some long standing issues, etc.
> 
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> 
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> My concern is that the v2 data source API is still evolving and not very close to stable. I had hoped to have stabilized the API and behaviors for a 3.0 release. But we could also wait on that for a 4.0 release, depending on when we think that will be.
> 
> Unless there is a pressing need to move to 3.0 for some other area, I think it would be better for the v2 sources to have a 2.5 release.
> 
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Vinoo Ganesh <vg...@palantir.com>.

Makes sense, thanks Reynold.

From: Reynold Xin <rx...@databricks.com>
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh <vg...@palantir.com>
Cc: Xiao Li <ga...@gmail.com>, Matei Zaharia <ma...@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, dev <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. In terms of time lining, unless we change anything specifically, Spark feature releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh <vg...@palantir.com>> wrote:
Quickly following up on this – is there a target date for when Spark 3.0 may be released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li <ga...@gmail.com>>
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin <rx...@databricks.com>>
Cc: Matei Zaharia <ma...@gmail.com>>, Ryan Blue <rb...@netflix.com>>, Mark Hamstra <ma...@clearstorydata.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin <rx...@databricks.com>> 于2018年9月28日周五 下午11:05写道：
i think we should create spips for some of them, since they are pretty large ... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury

On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next release will be 3.0. Now, we can start working on the API breaking changes (e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia <ma...@gmail.com>> 于2018年9月6日周四 下午2:21写道：
Yes, you can start with Unstable and move to Evolving and Stable when needed. We’ve definitely had experimental features that changed across maintenance releases when they were well-isolated. If your change risks breaking stuff in stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was suggesting was getting v2 code out more often than the point releases every 6 months. An Evolving API can change in point releases, but maybe we should move v2 to Unstable so it can change more often? I don't really see another way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY&s=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc&e=>
>
> As long as it is properly annotated, we can change or even eliminate an API method before the next major release. And frankly, we shouldn't be contemplating bringing in the DS v2 API (and, I'd argue, any new API) without such an annotation. There is just too much risk of not getting everything right before we see the results of the new API being more widely used, and too much cost in maintaining until the next major release something that we come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> It would be great to get more features out incrementally. For experimental features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, dsv1 and a lot of other major features all were developed throughout the 1.x and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there more frequently, to get feedback. Maybe that means we apply additive changes to 2.4.x; maybe that means making another 2.5 release sooner. I will start a separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 6 months?) but simply next. Do you mean you'd prefer that change to happen before 3.x? if it's a significant change, seems reasonable for a major version bump rather than minor. Is the concern that tying it to 3.0 means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> My concern is that the v2 data source API is still evolving and not very close to stable. I had hoped to have stabilized the API and behaviors for a 3.0 release. But we could also wait on that for a 4.0 release, depending on when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com>> wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
3.0. In terms of time lining, unless we change anything specifically, Spark
feature releases are on a 6-mo cadence. Spark 2.4 was just released last
week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh <vg...@palantir.com> wrote:

> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li <ga...@gmail.com>
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin <rx...@databricks.com>
> *Cc: *Matei Zaharia <ma...@gmail.com>, Ryan Blue <
> rblue@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, "
> user@spark.apache.org" <de...@spark.apache.org>
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin <rx...@databricks.com> 于2018年9月28日周五 下午11:05写道：
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com> wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY&s=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc&e=>
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
> >
> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> >
> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
> >
> > Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
> >
> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: time for Apache Spark 3.0?

Posted by Vinoo Ganesh <vg...@palantir.com>.

Quickly following up on this – is there a target date for when Spark 3.0 may be released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li <ga...@gmail.com>
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin <rx...@databricks.com>
Cc: Matei Zaharia <ma...@gmail.com>, Ryan Blue <rb...@netflix.com>, Mark Hamstra <ma...@clearstorydata.com>, "user@spark.apache.org" <de...@spark.apache.org>
Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin <rx...@databricks.com>> 于2018年9月28日周五 下午11:05写道：
i think we should create spips for some of them, since they are pretty large ... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next release will be 3.0. Now, we can start working on the API breaking changes (e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia <ma...@gmail.com>> 于2018年9月6日周四 下午2:21写道：
Yes, you can start with Unstable and move to Evolving and Stable when needed. We’ve definitely had experimental features that changed across maintenance releases when they were well-isolated. If your change risks breaking stuff in stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was suggesting was getting v2 code out more often than the point releases every 6 months. An Evolving API can change in point releases, but maybe we should move v2 to Unstable so it can change more often? I don't really see another way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY&s=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc&e=>
>
> As long as it is properly annotated, we can change or even eliminate an API method before the next major release. And frankly, we shouldn't be contemplating bringing in the DS v2 API (and, I'd argue, any new API) without such an annotation. There is just too much risk of not getting everything right before we see the results of the new API being more widely used, and too much cost in maintaining until the next major release something that we come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> It would be great to get more features out incrementally. For experimental features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, dsv1 and a lot of other major features all were developed throughout the 1.x and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there more frequently, to get feedback. Maybe that means we apply additive changes to 2.4.x; maybe that means making another 2.5 release sooner. I will start a separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 6 months?) but simply next. Do you mean you'd prefer that change to happen before 3.x? if it's a significant change, seems reasonable for a major version bump rather than minor. Is the concern that tying it to 3.0 means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> My concern is that the v2 data source API is still evolving and not very close to stable. I had hoped to have stabilized the API and behaviors for a 3.0 release. But we could also wait on that for a 4.0 release, depending on when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com>> wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: time for Apache Spark 3.0?

Posted by Xiao Li <ga...@gmail.com>.

Yes. We should create a SPIP for each major breaking change.

Reynold Xin <rx...@databricks.com> 于2018年9月28日周五 下午11:05写道：

> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com> wrote:
>
>> Based on the above discussions, we have a "rough consensus" that the next
>> release will be 3.0. Now, we can start working on the API breaking changes
>> (e.g., the ones mentioned in the original email from Reynold).
>>
>> Cheers,
>>
>> Xiao
>>
>> Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：
>>
>>> Yes, you can start with Unstable and move to Evolving and Stable when
>>> needed. We’ve definitely had experimental features that changed across
>>> maintenance releases when they were well-isolated. If your change risks
>>> breaking stuff in stable components of Spark though, then it probably won’t
>>> be suitable for that.
>>>
>>> > On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>> wrote:
>>> >
>>> > I meant flexibility beyond the point releases. I think what Reynold
>>> was suggesting was getting v2 code out more often than the point releases
>>> every 6 months. An Evolving API can change in point releases, but maybe we
>>> should move v2 to Unstable so it can change more often? I don't really see
>>> another way to get changes out more often.
>>> >
>>> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>> > Yes, that is why we have these annotations in the code and the
>>> corresponding labels appearing in the API documentation:
>>> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>>> >
>>> > As long as it is properly annotated, we can change or even eliminate
>>> an API method before the next major release. And frankly, we shouldn't be
>>> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
>>> without such an annotation. There is just too much risk of not getting
>>> everything right before we see the results of the new API being more widely
>>> used, and too much cost in maintaining until the next major release
>>> something that we come to regret for us to create new API in a fully frozen
>>> state.
>>> >
>>> >
>>> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>> > It would be great to get more features out incrementally. For
>>> experimental features, do we have more relaxed constraints?
>>> >
>>> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> > +1 on 3.0
>>> >
>>> > Dsv2 stable can still evolve in across major releases. DataFrame,
>>> Dataset, dsv1 and a lot of other major features all were developed
>>> throughout the 1.x and 2.x lines.
>>> >
>>> > I do want to explore ways for us to get dsv2 incremental changes out
>>> there more frequently, to get feedback. Maybe that means we apply additive
>>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>>> will start a separate thread about it.
>>> >
>>> >
>>> >
>>> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>>> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>> >
>>> > I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>> >
>>> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>> >
>>> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>> > My concern is that the v2 data source API is still evolving and not
>>> very close to stable. I had hoped to have stabilized the API and behaviors
>>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>>> depending on when we think that will be.
>>> >
>>> > Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>> >
>>> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>> > Yesterday, the 2.4 branch was created. Based on the above discussion,
>>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

i think we should create spips for some of them, since they are pretty
large ... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li <ga...@gmail.com> wrote:

> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
> Cheers,
>
> Xiao
>
> Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：
>
>> Yes, you can start with Unstable and move to Evolving and Stable when
>> needed. We’ve definitely had experimental features that changed across
>> maintenance releases when they were well-isolated. If your change risks
>> breaking stuff in stable components of Spark though, then it probably won’t
>> be suitable for that.
>>
>> > On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID>
>> wrote:
>> >
>> > I meant flexibility beyond the point releases. I think what Reynold was
>> suggesting was getting v2 code out more often than the point releases every
>> 6 months. An Evolving API can change in point releases, but maybe we should
>> move v2 to Unstable so it can change more often? I don't really see another
>> way to get changes out more often.
>> >
>> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>> > Yes, that is why we have these annotations in the code and the
>> corresponding labels appearing in the API documentation:
>> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>> >
>> > As long as it is properly annotated, we can change or even eliminate an
>> API method before the next major release. And frankly, we shouldn't be
>> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
>> without such an annotation. There is just too much risk of not getting
>> everything right before we see the results of the new API being more widely
>> used, and too much cost in maintaining until the next major release
>> something that we come to regret for us to create new API in a fully frozen
>> state.
>> >
>> >
>> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>> > It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>> >
>> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
>> > +1 on 3.0
>> >
>> > Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>> >
>> > I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>> >
>> >
>> >
>> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>> >
>> > I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>> >
>> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>> >
>> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>> > My concern is that the v2 data source API is still evolving and not
>> very close to stable. I had hoped to have stabilized the API and behaviors
>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>> depending on when we think that will be.
>> >
>> > Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>> >
>> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>> >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: time for Apache Spark 3.0?

Posted by Xiao Li <ga...@gmail.com>.

Based on the above discussions, we have a "rough consensus" that the next
release will be 3.0. Now, we can start working on the API breaking changes
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia <ma...@gmail.com> 于2018年9月6日周四 下午2:21写道：

> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation:
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
> >
> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> >
> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
> >
> > Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
> >
> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: time for Apache Spark 3.0?

Posted by Matei Zaharia <ma...@gmail.com>.

Yes, you can start with Unstable and move to Evolving and Stable when needed. We’ve definitely had experimental features that changed across maintenance releases when they were well-isolated. If your change risks breaking stuff in stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was suggesting was getting v2 code out more often than the point releases every 6 months. An Evolving API can change in point releases, but maybe we should move v2 to Unstable so it can change more often? I don't really see another way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com> wrote:
> Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> 
> As long as it is properly annotated, we can change or even eliminate an API method before the next major release. And frankly, we shouldn't be contemplating bringing in the DS v2 API (and, I'd argue, any new API) without such an annotation. There is just too much risk of not getting everything right before we see the results of the new API being more widely used, and too much cost in maintaining until the next major release something that we come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> It would be great to get more features out incrementally. For experimental features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, dsv1 and a lot of other major features all were developed throughout the 1.x and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there more frequently, to get feedback. Maybe that means we apply additive changes to 2.4.x; maybe that means making another 2.5 release sooner. I will start a separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 6 months?) but simply next. Do you mean you'd prefer that change to happen before 3.x? if it's a significant change, seems reasonable for a major version bump rather than minor. Is the concern that tying it to 3.0 means you have to take a major version update to get it?
> 
> I generally support moving on to 3.x so we can also jettison a lot of older dependencies, code, fix some long standing issues, etc.
> 
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> 
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> My concern is that the v2 data source API is still evolving and not very close to stable. I had hoped to have stabilized the API and behaviors for a 3.0 release. But we could also wait on that for a 4.0 release, depending on when we think that will be.
> 
> Unless there is a pressing need to move to 3.0 for some other area, I think it would be better for the v2 sources to have a 2.5 release.
> 
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I meant flexibility beyond the point releases. I think what Reynold was
suggesting was getting v2 code out more often than the point releases every
6 months. An Evolving API can change in point releases, but maybe we should
move v2 to Unstable so it can change more often? I don't really see another
way to get changes out more often.

On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation:
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>
> As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>>
>> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> +1 on 3.0
>>>
>>> Dsv2 stable can still evolve in across major releases. DataFrame,
>>> Dataset, dsv1 and a lot of other major features all were developed
>>> throughout the 1.x and 2.x lines.
>>>
>>> I do want to explore ways for us to get dsv2 incremental changes out
>>> there more frequently, to get feedback. Maybe that means we apply additive
>>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>>> will start a separate thread about it.
>>>
>>>
>>>
>>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>>> means you have to take a major version update to get it?
>>>>
>>>> I generally support moving on to 3.x so we can also jettison a lot of
>>>> older dependencies, code, fix some long standing issues, etc.
>>>>
>>>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>>>
>>>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> My concern is that the v2 data source API is still evolving and not
>>>>> very close to stable. I had hoped to have stabilized the API and behaviors
>>>>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>>>>> depending on when we think that will be.
>>>>>
>>>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>>>> think it would be better for the v2 sources to have a 2.5 release.
>>>>>
>>>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>>>>
>>>>>> Yesterday, the 2.4 branch was created. Based on the above discussion,
>>>>>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: time for Apache Spark 3.0?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Yes, that is why we have these annotations in the code and the
corresponding labels appearing in the API documentation:
https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java

As long as it is properly annotated, we can change or even eliminate an API
method before the next major release. And frankly, we shouldn't be
contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
without such an annotation. There is just too much risk of not getting
everything right before we see the results of the new API being more widely
used, and too much cost in maintaining until the next major release
something that we come to regret for us to create new API in a fully frozen
state.


On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> It would be great to get more features out incrementally. For experimental
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> +1 on 3.0
>>
>> Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>>
>> I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>>
>>
>>
>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>>
>>> I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>>
>>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>>
>>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> My concern is that the v2 data source API is still evolving and not
>>>> very close to stable. I had hoped to have stabilized the API and behaviors
>>>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>>>> depending on when we think that will be.
>>>>
>>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>>> think it would be better for the v2 sources to have a 2.5 release.
>>>>
>>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Yesterday, the 2.4 branch was created. Based on the above discussion,
>>>>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>>>
>>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: time for Apache Spark 3.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

It would be great to get more features out incrementally. For experimental
features, do we have more relaxed constraints?

On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin <rx...@databricks.com> wrote:

> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
> dsv1 and a lot of other major features all were developed throughout the
> 1.x and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there
> more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>>
>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

+1 on 3.0

Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
dsv1 and a lot of other major features all were developed throughout the
1.x and 2.x lines.

I do want to explore ways for us to get dsv2 incremental changes out there
more frequently, to get feedback. Maybe that means we apply additive
changes to 2.4.x; maybe that means making another 2.5 release sooner. I
will start a separate thread about it.



On Thu, Sep 6, 2018 at 9:31 AM Sean Owen <sr...@gmail.com> wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>

Re: time for Apache Spark 3.0?

Posted by Sean Owen <sr...@gmail.com>.

I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
timing? 6 months?) but simply next. Do you mean you'd prefer that change to
happen before 3.x? if it's a significant change, seems reasonable for a
major version bump rather than minor. Is the concern that tying it to 3.0
means you have to take a major version update to get it?

I generally support moving on to 3.x so we can also jettison a lot of older
dependencies, code, fix some long standing issues, etc.

(BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)

On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:
>
>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>
>>

Re: time for Apache Spark 3.0?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

My concern is that the v2 data source API is still evolving and not very
close to stable. I had hoped to have stabilized the API and behaviors for a
3.0 release. But we could also wait on that for a 4.0 release, depending on
when we think that will be.

Unless there is a pressing need to move to 3.0 for some other area, I think
it would be better for the v2 sources to have a 2.5 release.

On Thu, Sep 6, 2018 at 8:59 AM Xiao Li <ga...@gmail.com> wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
> Thanks,
>
> Xiao
>
> vaquar khan <va...@gmail.com> 于2018年6月16日周六 上午10:21写道：
>
>> +1  for 2.4 next, followed by 3.0.
>>
>> Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
>> is it possible we can share future release proposed specification same
>> like  releases (
>> https://spark.apache.org/releases/spark-release-2-3-0.html)
>> Regards,
>> Viquar khan
>>
>> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan <va...@gmail.com>
>> wrote:
>>
>>> Plz ignore last email link (you tube )not sure how it added .
>>> Apologies not sure how to delete it.
>>>
>>>
>>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <va...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>>>
>>>>>
>>>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree, I dont see pressing need for major version bump as well.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <
>>>>>> mark@clearstorydata.com> wrote:
>>>>>> >
>>>>>> > Changing major version numbers is not about new features or a vague
>>>>>> notion that it is time to do something that will be seen to be a
>>>>>> significant release. It is about breaking stable public APIs.
>>>>>> >
>>>>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>>>>> >
>>>>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>>>>>> >>
>>>>>> >> Dear all:
>>>>>> >>
>>>>>> >> It have been 2 months since this topic being proposed. Any
>>>>>> progress now? 2018 has been passed about 1/2.
>>>>>> >>
>>>>>> >> I agree with that the new version should be some exciting new
>>>>>> feature. How about this one:
>>>>>> >>
>>>>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>>>>> (Such as Angel / BigDL / ……)
>>>>>> >>
>>>>>> >> 3.0 is a very important version for an good open source project.
>>>>>> It should be better to drift away the historical burden and focus in new
>>>>>> area. Spark has been widely used all over the world as a successful big
>>>>>> data framework. And it can be better than that.
>>>>>> >>
>>>>>> >> Andy
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> There was a discussion thread on scala-contributors about Apache
>>>>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>>>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>>>>> out, it will be more than 2 years since Spark 2.0.
>>>>>> >>>
>>>>>> >>> For contributors less familiar with Spark’s history, I want to
>>>>>> give more context on Spark releases:
>>>>>> >>>
>>>>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>>>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>>>>> Spark 3.0 in 2018.
>>>>>> >>>
>>>>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>>>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>>>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>>>>> 2.0, 2.x to 3.0).
>>>>>> >>>
>>>>>> >>> 3. That said, a major version isn’t necessarily the playground
>>>>>> for disruptive API changes to make it painful for users to update. The main
>>>>>> purpose of a major release is an opportunity to fix things that are broken
>>>>>> in the current API and remove certain deprecated APIs.
>>>>>> >>>
>>>>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>>>>> developing major new features incrementally, so major releases are not the
>>>>>> only time for exciting new features. For example, the bulk of the work in
>>>>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>>>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>>>>> than major releases.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> You can find more background in the thread discussing Spark 2.0:
>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> The primary motivating factor IMO for a major version bump is to
>>>>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>>>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>>>>> changes that we know have been biting us for a long time but can’t be
>>>>>> changed in feature releases (to be clear, I’m actually not sure they are
>>>>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>>>> >>>
>>>>>> >>> 1. Support Scala 2.12.
>>>>>> >>>
>>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel)
>>>>>> deprecated in Spark 2.x.
>>>>>> >>>
>>>>>> >>> 3. Shade all dependencies.
>>>>>> >>>
>>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>>>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>>>>> backward compatibility mode.
>>>>>> >>>
>>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL
>>>>>> more standard compliant, and have a flag for backward compatibility.
>>>>>> >>>
>>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not
>>>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Now the reality of a major version bump is that the world often
>>>>>> thinks in terms of what exciting features are coming. I do think there are
>>>>>> a number of major changes happening already that can be part of the 3.0
>>>>>> release, if they make it in:
>>>>>> >>>
>>>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>>>> >>> 2. Continuous Processing non-experimental
>>>>>> >>> 3. Kubernetes support non-experimental
>>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t
>>>>>> think it is realistic to stabilize that in one release)
>>>>>> >>> 5. Hadoop 3.0 support
>>>>>> >>> 6. ...
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>>>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>>>>> release, rather than the individual feature requests. Those are important
>>>>>> but are best done in their own separate threads.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783
>>>> Greater Chicago
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783
>>> Greater Chicago
>>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: time for Apache Spark 3.0?

Posted by Xiao Li <ga...@gmail.com>.

Yesterday, the 2.4 branch was created. Based on the above discussion, I
think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?

Thanks,

Xiao

vaquar khan <va...@gmail.com> 于2018年6月16日周六 上午10:21写道：

> +1  for 2.4 next, followed by 3.0.
>
> Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
> is it possible we can share future release proposed specification same
> like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html
> )
> Regards,
> Viquar khan
>
> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan <va...@gmail.com>
> wrote:
>
>> Plz ignore last email link (you tube )not sure how it added .
>> Apologies not sure how to delete it.
>>
>>
>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <va...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>>
>>>>
>>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
>>>> wrote:
>>>>
>>>>> I agree, I dont see pressing need for major version bump as well.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
>>>>> wrote:
>>>>> >
>>>>> > Changing major version numbers is not about new features or a vague
>>>>> notion that it is time to do something that will be seen to be a
>>>>> significant release. It is about breaking stable public APIs.
>>>>> >
>>>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>>>> >
>>>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>>>>> >>
>>>>> >> Dear all:
>>>>> >>
>>>>> >> It have been 2 months since this topic being proposed. Any progress
>>>>> now? 2018 has been passed about 1/2.
>>>>> >>
>>>>> >> I agree with that the new version should be some exciting new
>>>>> feature. How about this one:
>>>>> >>
>>>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>>>> (Such as Angel / BigDL / ……)
>>>>> >>
>>>>> >> 3.0 is a very important version for an good open source project. It
>>>>> should be better to drift away the historical burden and focus in new area.
>>>>> Spark has been widely used all over the world as a successful big data
>>>>> framework. And it can be better than that.
>>>>> >>
>>>>> >> Andy
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> There was a discussion thread on scala-contributors about Apache
>>>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>>>> out, it will be more than 2 years since Spark 2.0.
>>>>> >>>
>>>>> >>> For contributors less familiar with Spark’s history, I want to
>>>>> give more context on Spark releases:
>>>>> >>>
>>>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>>>> Spark 3.0 in 2018.
>>>>> >>>
>>>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>>>> 2.0, 2.x to 3.0).
>>>>> >>>
>>>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>>>> disruptive API changes to make it painful for users to update. The main
>>>>> purpose of a major release is an opportunity to fix things that are broken
>>>>> in the current API and remove certain deprecated APIs.
>>>>> >>>
>>>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>>>> developing major new features incrementally, so major releases are not the
>>>>> only time for exciting new features. For example, the bulk of the work in
>>>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>>>> than major releases.
>>>>> >>>
>>>>> >>>
>>>>> >>> You can find more background in the thread discussing Spark 2.0:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>>> >>>
>>>>> >>>
>>>>> >>> The primary motivating factor IMO for a major version bump is to
>>>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>>>> changes that we know have been biting us for a long time but can’t be
>>>>> changed in feature releases (to be clear, I’m actually not sure they are
>>>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>>> >>>
>>>>> >>> 1. Support Scala 2.12.
>>>>> >>>
>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>>>> in Spark 2.x.
>>>>> >>>
>>>>> >>> 3. Shade all dependencies.
>>>>> >>>
>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>>>> backward compatibility mode.
>>>>> >>>
>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>>>>> standard compliant, and have a flag for backward compatibility.
>>>>> >>>
>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not
>>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>>> >>>
>>>>> >>>
>>>>> >>> Now the reality of a major version bump is that the world often
>>>>> thinks in terms of what exciting features are coming. I do think there are
>>>>> a number of major changes happening already that can be part of the 3.0
>>>>> release, if they make it in:
>>>>> >>>
>>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>>> >>> 2. Continuous Processing non-experimental
>>>>> >>> 3. Kubernetes support non-experimental
>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t think
>>>>> it is realistic to stabilize that in one release)
>>>>> >>> 5. Hadoop 3.0 support
>>>>> >>> 6. ...
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>>>> release, rather than the individual feature requests. Those are important
>>>>> but are best done in their own separate threads.
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783
>>> Greater Chicago
>>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>

Re: time for Apache Spark 3.0?

Posted by vaquar khan <va...@gmail.com>.

+1  for 2.4 next, followed by 3.0.

Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
is it possible we can share future release proposed specification same
like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html)
Regards,
Viquar khan

On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan <va...@gmail.com> wrote:

> Plz ignore last email link (you tube )not sure how it added .
> Apologies not sure how to delete it.
>
>
> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <va...@gmail.com>
> wrote:
>
>> +1
>>
>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>
>> Regards,
>> Vaquar khan
>>
>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>
>>>
>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>>
>>>> I agree, I dont see pressing need for major version bump as well.
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>> >
>>>> > Changing major version numbers is not about new features or a vague
>>>> notion that it is time to do something that will be seen to be a
>>>> significant release. It is about breaking stable public APIs.
>>>> >
>>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>>> >
>>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>>>> >>
>>>> >> Dear all:
>>>> >>
>>>> >> It have been 2 months since this topic being proposed. Any progress
>>>> now? 2018 has been passed about 1/2.
>>>> >>
>>>> >> I agree with that the new version should be some exciting new
>>>> feature. How about this one:
>>>> >>
>>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>>> (Such as Angel / BigDL / ……)
>>>> >>
>>>> >> 3.0 is a very important version for an good open source project. It
>>>> should be better to drift away the historical burden and focus in new area.
>>>> Spark has been widely used all over the world as a successful big data
>>>> framework. And it can be better than that.
>>>> >>
>>>> >> Andy
>>>> >>
>>>> >>
>>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >>>
>>>> >>> There was a discussion thread on scala-contributors about Apache
>>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>>> out, it will be more than 2 years since Spark 2.0.
>>>> >>>
>>>> >>> For contributors less familiar with Spark’s history, I want to give
>>>> more context on Spark releases:
>>>> >>>
>>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>>> Spark 3.0 in 2018.
>>>> >>>
>>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>>> 2.0, 2.x to 3.0).
>>>> >>>
>>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>>> disruptive API changes to make it painful for users to update. The main
>>>> purpose of a major release is an opportunity to fix things that are broken
>>>> in the current API and remove certain deprecated APIs.
>>>> >>>
>>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>>> developing major new features incrementally, so major releases are not the
>>>> only time for exciting new features. For example, the bulk of the work in
>>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>>> than major releases.
>>>> >>>
>>>> >>>
>>>> >>> You can find more background in the thread discussing Spark 2.0:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-
>>>> proposal-for-Spark-2-0-td15122.html
>>>> >>>
>>>> >>>
>>>> >>> The primary motivating factor IMO for a major version bump is to
>>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>>> changes that we know have been biting us for a long time but can’t be
>>>> changed in feature releases (to be clear, I’m actually not sure they are
>>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>> >>>
>>>> >>> 1. Support Scala 2.12.
>>>> >>>
>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>>> in Spark 2.x.
>>>> >>>
>>>> >>> 3. Shade all dependencies.
>>>> >>>
>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>>> backward compatibility mode.
>>>> >>>
>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>>>> standard compliant, and have a flag for backward compatibility.
>>>> >>>
>>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not
>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>> >>>
>>>> >>>
>>>> >>> Now the reality of a major version bump is that the world often
>>>> thinks in terms of what exciting features are coming. I do think there are
>>>> a number of major changes happening already that can be part of the 3.0
>>>> release, if they make it in:
>>>> >>>
>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>> >>> 2. Continuous Processing non-experimental
>>>> >>> 3. Kubernetes support non-experimental
>>>> >>> 4. A more flushed out version of data source API v2 (I don’t think
>>>> it is realistic to stabilize that in one release)
>>>> >>> 5. Hadoop 3.0 support
>>>> >>> 6. ...
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>>> release, rather than the individual feature requests. Those are important
>>>> but are best done in their own separate threads.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: time for Apache Spark 3.0?

Posted by vaquar khan <va...@gmail.com>.

Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.


On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan <va...@gmail.com> wrote:

> +1
>
> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>
> Regards,
> Vaquar khan
>
> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>
>>
>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>>
>>> I agree, I dont see pressing need for major version bump as well.
>>>
>>>
>>> Regards,
>>> Mridul
>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>> >
>>> > Changing major version numbers is not about new features or a vague
>>> notion that it is time to do something that will be seen to be a
>>> significant release. It is about breaking stable public APIs.
>>> >
>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>> >
>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>>> >>
>>> >> Dear all:
>>> >>
>>> >> It have been 2 months since this topic being proposed. Any progress
>>> now? 2018 has been passed about 1/2.
>>> >>
>>> >> I agree with that the new version should be some exciting new
>>> feature. How about this one:
>>> >>
>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>> (Such as Angel / BigDL / ……)
>>> >>
>>> >> 3.0 is a very important version for an good open source project. It
>>> should be better to drift away the historical burden and focus in new area.
>>> Spark has been widely used all over the world as a successful big data
>>> framework. And it can be better than that.
>>> >>
>>> >> Andy
>>> >>
>>> >>
>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>>
>>> >>> There was a discussion thread on scala-contributors about Apache
>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>> out, it will be more than 2 years since Spark 2.0.
>>> >>>
>>> >>> For contributors less familiar with Spark’s history, I want to give
>>> more context on Spark releases:
>>> >>>
>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>> Spark 3.0 in 2018.
>>> >>>
>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>> 2.0, 2.x to 3.0).
>>> >>>
>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>> disruptive API changes to make it painful for users to update. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs.
>>> >>>
>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>> developing major new features incrementally, so major releases are not the
>>> only time for exciting new features. For example, the bulk of the work in
>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>> than major releases.
>>> >>>
>>> >>>
>>> >>> You can find more background in the thread discussing Spark 2.0:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-
>>> proposal-for-Spark-2-0-td15122.html
>>> >>>
>>> >>>
>>> >>> The primary motivating factor IMO for a major version bump is to
>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>> >>>
>>> >>> 1. Support Scala 2.12.
>>> >>>
>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>> in Spark 2.x.
>>> >>>
>>> >>> 3. Shade all dependencies.
>>> >>>
>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>> backward compatibility mode.
>>> >>>
>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>>> standard compliant, and have a flag for backward compatibility.
>>> >>>
>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>> >>>
>>> >>>
>>> >>> Now the reality of a major version bump is that the world often
>>> thinks in terms of what exciting features are coming. I do think there are
>>> a number of major changes happening already that can be part of the 3.0
>>> release, if they make it in:
>>> >>>
>>> >>> 1. Scala 2.12 support (listing it twice)
>>> >>> 2. Continuous Processing non-experimental
>>> >>> 3. Kubernetes support non-experimental
>>> >>> 4. A more flushed out version of data source API v2 (I don’t think
>>> it is realistic to stabilize that in one release)
>>> >>> 5. Hadoop 3.0 support
>>> >>> 6. ...
>>> >>>
>>> >>>
>>> >>>
>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>> release, rather than the individual feature requests. Those are important
>>> but are best done in their own separate threads.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: time for Apache Spark 3.0?

Posted by vaquar khan <va...@gmail.com>.

+1

https://www.youtube.com/watch?v=-ik7aJ5U6kg

Regards,
Vaquar khan

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin <rx...@databricks.com> wrote:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any progress
>> now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new feature.
>> How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project. It
>> should be better to drift away the historical burden and focus in new area.
>> Spark has been widely used all over the world as a successful big data
>> framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>> about time for Spark to work towards the 3.0 release. By the time it comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to give
>> more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-
>> Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>> >>>
>> >>> 1. Support Scala 2.12.
>> >>>
>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>> >>>
>> >>> 3. Shade all dependencies.
>> >>>
>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>> >>>
>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>> >>>
>> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>> >>>
>> >>>
>> >>> Now the reality of a major version bump is that the world often
>> thinks in terms of what exciting features are coming. I do think there are
>> a number of major changes happening already that can be part of the 3.0
>> release, if they make it in:
>> >>>
>> >>> 1. Scala 2.12 support (listing it twice)
>> >>> 2. Continuous Processing non-experimental
>> >>> 3. Kubernetes support non-experimental
>> >>> 4. A more flushed out version of data source API v2 (I don’t think it
>> is realistic to stabilize that in one release)
>> >>> 5. Hadoop 3.0 support
>> >>> 6. ...
>> >>>
>> >>>
>> >>>
>> >>> Similar to the 2.0 discussion, this thread should focus on the
>> framework and whether it’d make sense to create Spark 3.0 as the next
>> release, rather than the individual feature requests. Those are important
>> but are best done in their own separate threads.
>> >>>
>> >>>
>> >>>
>> >>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: time for Apache Spark 3.0?

Posted by Xiao Li <ga...@gmail.com>.

+1

2018-06-15 14:55 GMT-07:00 Reynold Xin <rx...@databricks.com>:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any progress
>> now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new feature.
>> How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project. It
>> should be better to drift away the historical burden and focus in new area.
>> Spark has been widely used all over the world as a successful big data
>> framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>> about time for Spark to work towards the 3.0 release. By the time it comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to give
>> more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-
>> Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>> >>>
>> >>> 1. Support Scala 2.12.
>> >>>
>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>> >>>
>> >>> 3. Shade all dependencies.
>> >>>
>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>> >>>
>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>> >>>
>> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>> >>>
>> >>>
>> >>> Now the reality of a major version bump is that the world often
>> thinks in terms of what exciting features are coming. I do think there are
>> a number of major changes happening already that can be part of the 3.0
>> release, if they make it in:
>> >>>
>> >>> 1. Scala 2.12 support (listing it twice)
>> >>> 2. Continuous Processing non-experimental
>> >>> 3. Kubernetes support non-experimental
>> >>> 4. A more flushed out version of data source API v2 (I don’t think it
>> is realistic to stabilize that in one release)
>> >>> 5. Hadoop 3.0 support
>> >>> 6. ...
>> >>>
>> >>>
>> >>>
>> >>> Similar to the 2.0 discussion, this thread should focus on the
>> framework and whether it’d make sense to create Spark 3.0 as the next
>> release, rather than the individual feature requests. Those are important
>> but are best done in their own separate threads.
>> >>>
>> >>>
>> >>>
>> >>>
>>
>

Re: time for Apache Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan <mr...@gmail.com>
wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new feature.
> How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com> wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache Spark
> not yet supporting Scala 2.12, and that got me to think perhaps it is about
> time for Spark to work towards the 3.0 release. By the time it comes out,
> it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to give
> more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016.
> If we were to maintain the ~ 2 year cadence, it is time to work on Spark
> 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
> >>>
> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
> >>>
> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
> >>>
> >>>
> >>> Now the reality of a major version bump is that the world often thinks
> in terms of what exciting features are coming. I do think there are a
> number of major changes happening already that can be part of the 3.0
> release, if they make it in:
> >>>
> >>> 1. Scala 2.12 support (listing it twice)
> >>> 2. Continuous Processing non-experimental
> >>> 3. Kubernetes support non-experimental
> >>> 4. A more flushed out version of data source API v2 (I don’t think it
> is realistic to stabilize that in one release)
> >>> 5. Hadoop 3.0 support
> >>> 6. ...
> >>>
> >>>
> >>>
> >>> Similar to the 2.0 discussion, this thread should focus on the
> framework and whether it’d make sense to create Spark 3.0 as the next
> release, rather than the individual feature requests. Those are important
> but are best done in their own separate threads.
> >>>
> >>>
> >>>
> >>>
>

Re: time for Apache Spark 3.0?

Posted by Mridul Muralidharan <mr...@gmail.com>.

I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <ma...@clearstorydata.com> wrote:
>
> Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should be better to drift away the historical burden and focus in new area. Spark has been widely used all over the world as a successful big data framework. And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not yet supporting Scala 2.12, and that got me to think perhaps it is about time for Spark to work towards the 3.0 release. By the time it comes out, it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for disruptive API changes to make it painful for users to update. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing major new features incrementally, so major releases are not the only time for exciting new features. For example, the bulk of the work in the move towards the DataFrame API was done in Spark 1.3, and Continuous Processing was introduced in Spark 2.3. Both were feature releases rather than major releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar to Spark 2.0, I think there are also opportunities for other changes that we know have been biting us for a long time but can’t be changed in feature releases (to be clear, I’m actually not sure they are all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it less painful for users to upgrade here, I’d suggest creating a flag for backward compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in terms of what exciting features are coming. I do think there are a number of major changes happening already that can be part of the 3.0 release, if they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework and whether it’d make sense to create Spark 3.0 as the next release, rather than the individual feature requests. Those are important but are best done in their own separate threads.
>>>
>>>
>>>
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: time for Apache Spark 3.0?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Changing major version numbers is not about new features or a vague notion
that it is time to do something that will be seen to be a significant
release. It is about breaking stable public APIs.

I still remain unconvinced that the next version can't be 2.4.0.

On Fri, Jun 15, 2018 at 1:34 AM Andy <an...@gmail.com> wrote:

> *Dear all:*
>
> It have been 2 months since this topic being proposed. Any progress now?
> 2018 has been passed about 1/2.
>
> I agree with that the new version should be some exciting new feature. How
> about this one:
>
> *6. ML/DL framework to be integrated as core component and feature. (Such
> as Angel / BigDL / ……)*
>
> 3.0 is a very important version for an good open source project. It should
> be better to drift away the historical burden and *focus in new area*.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
>
>
> *Andy*
>
>
> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> There was a discussion thread on scala-contributors
>> <https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-failure/1747>
>> about Apache Spark not yet supporting Scala 2.12, and that got me to think
>> perhaps it is about time for Spark to work towards the 3.0 release. By the
>> time it comes out, it will be more than 2 years since Spark 2.0.
>>
>> For contributors less familiar with Spark’s history, I want to give more
>> context on Spark releases:
>>
>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
>> in 2018.
>>
>> 2. Spark’s versioning policy promises that Spark does not break stable
>> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>>
>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>>
>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>>
>>
>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>
>>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>> 1. Support Scala 2.12.
>>
>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>>
>> 3. Shade all dependencies.
>>
>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>>
>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>>
>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>>
>>
>> Now the reality of a major version bump is that the world often thinks in
>> terms of what exciting features are coming. I do think there are a number
>> of major changes happening already that can be part of the 3.0 release, if
>> they make it in:
>>
>> 1. Scala 2.12 support (listing it twice)
>> 2. Continuous Processing non-experimental
>> 3. Kubernetes support non-experimental
>> 4. A more flushed out version of data source API v2 (I don’t think it is
>> realistic to stabilize that in one release)
>> 5. Hadoop 3.0 support
>> 6. ...
>>
>>
>>
>> Similar to the 2.0 discussion, this thread should focus on the framework
>> and whether it’d make sense to create Spark 3.0 as the next release, rather
>> than the individual feature requests. Those are important but are best done
>> in their own separate threads.
>>
>>
>>
>>
>>

Re: time for Apache Spark 3.0?

Posted by Andy <an...@gmail.com>.

*Dear all:*

It have been 2 months since this topic being proposed. Any progress now?
2018 has been passed about 1/2.

I agree with that the new version should be some exciting new feature. How
about this one:

*6. ML/DL framework to be integrated as core component and feature. (Such
as Angel / BigDL / ……)*

3.0 is a very important version for an good open source project. It should
be better to drift away the historical burden and *focus in new area*.
Spark has been widely used all over the world as a successful big data
framework. And it can be better than that.


*Andy*


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin <rx...@databricks.com> wrote:

> There was a discussion thread on scala-contributors
> <https://contributors.scala-lang.org/t/spark-as-a-scala-gateway-drug-and-the-2-12-failure/1747>
> about Apache Spark not yet supporting Scala 2.12, and that got me to think
> perhaps it is about time for Spark to work towards the 3.0 release. By the
> time it comes out, it will be more than 2 years since Spark 2.0.
>
> For contributors less familiar with Spark’s history, I want to give more
> context on Spark releases:
>
> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
> in 2018.
>
> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
>
> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
>
> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
>
>
> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>
>
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>
> 1. Support Scala 2.12.
>
> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
>
> 3. Shade all dependencies.
>
> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
>
> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
>
> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
>
>
> Now the reality of a major version bump is that the world often thinks in
> terms of what exciting features are coming. I do think there are a number
> of major changes happening already that can be part of the 3.0 release, if
> they make it in:
>
> 1. Scala 2.12 support (listing it twice)
> 2. Continuous Processing non-experimental
> 3. Kubernetes support non-experimental
> 4. A more flushed out version of data source API v2 (I don’t think it is
> realistic to stabilize that in one release)
> 5. Hadoop 3.0 support
> 6. ...
>
>
>
> Similar to the 2.0 discussion, this thread should focus on the framework
> and whether it’d make sense to create Spark 3.0 as the next release, rather
> than the individual feature requests. Those are important but are best done
> in their own separate threads.
>
>
>
>
>