You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2019/09/20 17:47:39 UTC

[DISCUSS] Spark 2.5 release

Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release
based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to
Spark 3.0 when it is released because they will be able to use a single
implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
upgrading to 3.0 won't also require also updating to Java 11 because users
could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested
in a release with the latest DSv2 API and support for DSv2 SQL. I'm already
going to be backporting DSv2 support to the Spark 2.4 line, so it makes
sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11
that assist compatibility, to keep the scope of the release small. The
purpose is to assist people moving to 3.0 and not distract from the 3.0
release.

Would a Spark 2.5 release help anyone else? Are there any concerns about
this plan?


rb


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I didn't realize that Java 11 would require breaking changes. What breaking
changes are required?

On Fri, Sep 20, 2019 at 11:18 AM Sean Owen <sr...@gmail.com> wrote:

> Narrowly on Java 11: the problem is that it'll take some breaking
> changes, more than would be usually appropriate in a minor release, I
> think. I'm still not convinced there is a burning need to use Java 11
> but stay on 2.4, after 3.0 is out, and at least the wheels are in
> motion there. Java 8 is still free and being updated.
>
> On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync this week, we talked about a possible Spark 2.5 release
> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> >
> > A Spark 2.5 release with these two additions will help people migrate to
> Spark 3.0 when it is released because they will be able to use a single
> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
> upgrading to 3.0 won't also require also updating to Java 11 because users
> could update to Java 11 with the 2.5 release and have fewer major changes.
> >
> > Another reason to consider a 2.5 release is that many people are
> interested in a release with the latest DSv2 API and support for DSv2 SQL.
> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
> it makes sense to share this work with the community.
> >
> > This release line would just consist of backports like DSv2 and Java 11
> that assist compatibility, to keep the scope of the release small. The
> purpose is to assist people moving to 3.0 and not distract from the 3.0
> release.
> >
> > Would a Spark 2.5 release help anyone else? Are there any concerns about
> this plan?
> >
> >
> > rb
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Sean Owen <sr...@gmail.com>.
Narrowly on Java 11: the problem is that it'll take some breaking
changes, more than would be usually appropriate in a minor release, I
think. I'm still not convinced there is a burning need to use Java 11
but stay on 2.4, after 3.0 is out, and at least the wheels are in
motion there. Java 8 is still free and being updated.

On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> Hi everyone,
>
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>
> A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.
>
> Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.
>
> This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.
>
> Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?
>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: [DISCUSS] Spark 2.5 release

Posted by Sean Owen <sr...@gmail.com>.
I don't know enough about DSv2 to comment on this part, but, any
theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it
as much as is possible help?

I say that because re: Java 11, the main breaking change is probably the
Hive 2 / Hadoop 3 dependency, JPMML (minor), as well as the general
classloader changes, handling of off-heap memory. These aren't big breaks,
but probably going to break some things. I think we'd want to see a 'proof
of concept' branch to evaluate just how much has to change to get it
working, and that is why I think a 2.5 release would still need more
investigation.

On Fri, Sep 20, 2019 at 1:19 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> > DSv2 is far from stable right?
>
> No, I think it is reasonably stable and very close to being ready for a
> release.
>
> > All the actual data types are unstable and you guys have completely
> ignored that.
>
> I think what you're referring to is the use of `InternalRow`. That's a
> stable API and there has been no work to avoid using it. In any case, I
> don't think that anyone is suggesting that we delay 3.0 until a replacement
> for `InternalRow` is added, right?
>
> While I understand the motivation for a better solution here, I think the
> pragmatic solution is to continue using `InternalRow`.
>
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
> invasive of a change to backport once you consider the parts needed to make
> dsv2 stable.
>
> I believe that those of us working on DSv2 are confident about the current
> stability. We set goals for what to get into the 3.0 release months ago and
> have very nearly reached the point where we are ready for that release.
>
> I don't think instability would be a problem in maintaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are ready to release 3.0, we can release a
> 2.5 with the same API.
>
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>>
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>>
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>>
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>>
>>>
>>> rb
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Xiao Li <li...@databricks.com>.
+1 on Jungtaek's point. We can revisit this when we release Spark 3.1?
After the release of 3.0, I believe we will get more feedback about DSv2
from the community. The current design is just made by a small group of
contributors. DSv2 + catalog APIs are still evolving. It is very likely we
will make more changes after 3.0 release.

On Fri, Sep 20, 2019 at 9:27 PM Jungtaek Lim <ka...@gmail.com> wrote:

> small correction: confusion -> conflict, so I had to go through and
> understand parts of the changes
>
> On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>> before 3.1.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>> (which will change and progress towards more stable, but will have to break
>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>
>>>>> To point out some problems with InternalRow that you think are already
>>>>> pragmatic and stable:
>>>>>
>>>>> The class is in catalyst, which states:
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>
>>>>> /**
>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>> classes in catalyst are
>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>> between minor releases.
>>>>> */
>>>>>
>>>>> There is no even any annotation on the interface.
>>>>>
>>>>> The entire dependency chain were created to be private, and tightly
>>>>> coupled with internal implementations. For example,
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>
>>>>> /**
>>>>> * A UTF-8 String for internal Spark use.
>>>>> * <p>
>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>> comparison,
>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>> * <p>
>>>>> * Note: This is not designed for general use cases, should not be used
>>>>> outside SQL.
>>>>> */
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>
>>>>> (which again is in catalyst package)
>>>>>
>>>>>
>>>>> If you want to argue this way, you might as well argue we should make
>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> When you created the PR to make InternalRow public
>>>>>>
>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>>> of starting with that API.
>>>>>>
>>>>>> Note that any change to InternalRow would be very costly to
>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>> UnsafeRow was part of the original proposal.
>>>>>>
>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>>> with it is also reasonable.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> To push back, while I agree we should not drastically change
>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>>>> back and forth.
>>>>>>>
>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>
>>>>>>>> No, I think it is reasonably stable and very close to being ready
>>>>>>>> for a release.
>>>>>>>>
>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>> completely ignored that.
>>>>>>>>
>>>>>>>> I think what you're referring to is the use of `InternalRow`.
>>>>>>>> That's a stable API and there has been no work to avoid using it. In any
>>>>>>>> case, I don't think that anyone is suggesting that we delay 3.0 until a
>>>>>>>> replacement for `InternalRow` is added, right?
>>>>>>>>
>>>>>>>> While I understand the motivation for a better solution here, I
>>>>>>>> think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>
>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems
>>>>>>>> too invasive of a change to backport once you consider the parts needed to
>>>>>>>> make dsv2 stable.
>>>>>>>>
>>>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>>>> release.
>>>>>>>>
>>>>>>>> I don't think instability would be a problem in maintaining
>>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>>>> release a 2.5 with the same API.
>>>>>>>>
>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> DSv2 is far from stable right? All the actual data types are
>>>>>>>>> unstable and you guys have completely ignored that. We'd need to work on
>>>>>>>>> that and that will be a breaking change. If the goal is to make DSv2 work
>>>>>>>>> across 3.x and 2.x, that seems too invasive of a change to backport once
>>>>>>>>> you consider the parts needed to make dsv2 stable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>>>>>>> added.
>>>>>>>>>>
>>>>>>>>>> A Spark 2.5 release with these two additions will help people
>>>>>>>>>> migrate to Spark 3.0 when it is released because they will be able to use a
>>>>>>>>>> single implementation for DSv2 sources that works in both 2.5 and 3.0.
>>>>>>>>>> Similarly, upgrading to 3.0 won't also require also updating to Java 11
>>>>>>>>>> because users could update to Java 11 with the 2.5 release and have fewer
>>>>>>>>>> major changes.
>>>>>>>>>>
>>>>>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>>>>>>> it makes sense to share this work with the community.
>>>>>>>>>>
>>>>>>>>>> This release line would just consist of backports like DSv2 and
>>>>>>>>>> Java 11 that assist compatibility, to keep the scope of the release small.
>>>>>>>>>> The purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>>>>>> release.
>>>>>>>>>>
>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any
>>>>>>>>>> concerns about this plan?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: [DISCUSS] Spark 2.5 release

Posted by Jungtaek Lim <ka...@gmail.com>.
small correction: confusion -> conflict, so I had to go through and
understand parts of the changes

On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <ka...@gmail.com> wrote:

> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet is
> that DSv2 would be already changed in incompatible way, at least who works
> for custom DataSource. Making downstream to diverge their implementation
> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
> experience - especially we are not completely closed the chance to further
> modify DSv2, and the change could be backward incompatible.
>
> If we really want to bring the DSv2 change to 2.x version line to let end
> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
> preparation of Spark 2.5 should be started after Spark 3.0 is officially
> released, honestly even later than that, say, getting some reports from
> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
> upgrade to next minor version.
>
> Btw, do we have any specific target users for this? Personally DSv2 change
> would be the major backward incompatibility which Spark 2.x users may
> hesitate to upgrade, so they might be already prepared to migrate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>> (which will change and progress towards more stable, but will have to break
>>>> certain APIs) and 2.x seems like a false premise.
>>>>
>>>> To point out some problems with InternalRow that you think are already
>>>> pragmatic and stable:
>>>>
>>>> The class is in catalyst, which states:
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>
>>>> /**
>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>> classes in catalyst are
>>>> * considered an internal API to Spark SQL and are subject to change
>>>> between minor releases.
>>>> */
>>>>
>>>> There is no even any annotation on the interface.
>>>>
>>>> The entire dependency chain were created to be private, and tightly
>>>> coupled with internal implementations. For example,
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>
>>>> /**
>>>> * A UTF-8 String for internal Spark use.
>>>> * <p>
>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>> comparison,
>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>> * <p>
>>>> * Note: This is not designed for general use cases, should not be used
>>>> outside SQL.
>>>> */
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>
>>>> (which again is in catalyst package)
>>>>
>>>>
>>>> If you want to argue this way, you might as well argue we should make
>>>> the entire catalyst package public to be pragmatic and not allow any
>>>> changes.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> When you created the PR to make InternalRow public
>>>>>
>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>> of starting with that API.
>>>>>
>>>>> Note that any change to InternalRow would be very costly to implement
>>>>> because this interface is widely used. That is why I think we can certainly
>>>>> consider it stable enough to use here, and that’s probably why
>>>>> UnsafeRow was part of the original proposal.
>>>>>
>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>> with it is also reasonable.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> To push back, while I agree we should not drastically change
>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>> back and forth.
>>>>>
>>>>> When you created the PR to make InternalRow public, the understanding
>>>>> was to work towards making it stable in the future, assuming we will start
>>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>>> tightly coupled with other internal pieces public and stable and call it a
>>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>>> the rest of Spark doesn't change.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> > DSv2 is far from stable right?
>>>>>
>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>> a release.
>>>>>
>>>>> > All the actual data types are unstable and you guys have completely
>>>>> ignored that.
>>>>>
>>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>>> stable API and there has been no work to avoid using it. In any case, I
>>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>>> for `InternalRow` is added, right?
>>>>>
>>>>> While I understand the motivation for a better solution here, I think
>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>
>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>> invasive of a change to backport once you consider the parts needed to make
>>>>> dsv2 stable.
>>>>>
>>>>> I believe that those of us working on DSv2 are confident about the
>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>> ago and have very nearly reached the point where we are ready for that
>>>>> release.
>>>>>
>>>>> I don't think instability would be a problem in maintaining
>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>> we need to make API changes (other than additions) then we can make those
>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>> release a 2.5 with the same API.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>>> the parts needed to make dsv2 stable.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rblue@netflix.com.invalid
>>>>> > wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>> added.
>>>>>
>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>
>>>>> Another reason to consider a 2.5 release is that many people are
>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>> it makes sense to share this work with the community.
>>>>>
>>>>> This release line would just consist of backports like DSv2 and Java
>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>> release.
>>>>>
>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>> about this plan?
>>>>>
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

RE: [DISCUSS] Spark 2.5 release

Posted by JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>.
I’ll chime in as an actual implementor of a custom DataSource who is keeping an eye on the 3.0 DSv2 changes.

We started implementing DSv2 in the 2.4 branch, but quickly discovered that the DSv2 in 3.0 was a complete breaking change (to the point where it could have been named DSv3 and it wouldn’t have come as a surprise). Since the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided to fall back into DSv1 in order to ease the future transition to Spark 3.

From my point of view, a Spark 2.5 release with a backport of DSv2 _which does not remove the old 2.4 DSv2 classes_  would be ideal, since it would work as a stepping stone for both the current users of DSv1 and the 2.4 DSv2 classes.

I agree with Xiao that it is likely that the 3.0 DSv2 classes will need to incorporate feedback from the community once people start using them. I hope we aren’t planning on marking them as Stable as soon as Spark 3.0 is released! They don’t seen to have any InterfaceStability marker at the moment in master.

Cheers,
Ximo

De: Ryan Blue <rb...@netflix.com.INVALID>
Enviado el: miércoles, 25 de septiembre de 2019 0:54
Para: Jungtaek Lim <ka...@gmail.com>
CC: Dongjoon Hyun <do...@gmail.com>; Holden Karau <ho...@pigscanfly.ca>; Hyukjin Kwon <gu...@gmail.com>; Marco Gaido <ma...@gmail.com>; Matei Zaharia <ma...@gmail.com>; Reynold Xin <rx...@databricks.com>; Spark Dev List <de...@spark.apache.org>
Asunto: Re: [DISCUSS] Spark 2.5 release

> That's not a new requirement, that's an "implicit" requirement via semantic versioning.

The expectation is that the DSv2 API will change in minor versions in the 2.x line. The API is marked with the Experimental API annotation to signal that it can change, and it has been changing.

A requirement to not change this API for a 2.5 release is a new requirement. I'm fine with that if that's what everyone wants. Like I said, if we want to add a requirement to not change this API then we shouldn't release the 2.5 that I'm proposing.

On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim <ka...@gmail.com>> wrote:
>> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

> This has not been a requirement for DSv2 development so far. If this is a new requirement, then we should not do a 2.5 release.

My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever have a chance to think about such requirement - that's why there's no restriction on breaking compatibility on codebase. That's not a new requirement, that's an "implicit" requirement via semantic versioning. I agree that some of APIs have been changed between Spark 2.x versions, but I guess the changes in "new" DSv2 would be bigger than summation of changes on "old" DSv2 which has been introduced across multiple minor versions.

Suppose we're developers of Spark ecosystem maintaining custom data source (forget about developing Spark): I would get some official announcement on next minor version, and I want to try it out quickly to see my stuff still supports new version. When I change the dependency version everything will break. My hopeful expectation would be no issue while upgrading but turns out it's not, and even it requires new learning (not only fixing compilation failures). It would just make me giving up support Spark 2.5 or at least I won't follow up such change quickly. IMHO 3.0-techpreview has advantage here (assuming we provide maven artifacts as well as official announcement), as it can give us expectation that there're bunch of changes given it's a new major version. It also provides bunch of time to try adopting it before the version is officially released.


On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue <rb...@netflix.com>> wrote:
From those questions, I can see that there is significant confusion about what I'm proposing, so let me try to clear it up.

> 1. Is DSv2 stable in `master`?

DSv2 has reached a stable API that is capable of supporting all of the features we intend to deliver for Spark 3.0. The proposal is to backport the same API and features for Spark 2.5.

I am not saying that this API won't change after 3.0. Notably, Reynold wants to change the use of InternalRow. But, these changes are after 3.0 and don't affect the compatibility I'm proposing, between the 2.5 and 3.0 releases. I also doubt that breaking changes would happen by 3.1.

> 2. If then, what subset of DSv2 patches does Ryan is suggesting backporting?

I am proposing backporting what we intend to deliver for 3.0: the API currently in master, SQL support, and multi-catalog support.

> 3. How much those backporting DSv2 patches looks differently in `branch-2.4`?

DSv2 is mostly an addition located in the `connector` package. It also changes some parts of the SQL parser and adds parsed plans, as well as new rules to convert from parsed plans. This is not an invasive change because we kept most of DSv2 separate. DSv2 should be nearly identical between the two branches.

> 4. What does he mean by `without breaking changes? Is it technically feasible?

DSv2 is marked unstable in the 2.x line and changes between releases. The API changed between 2.3 and 2.4, so this would be no different. But, we would keep the API the same between 2.5 and 3.0 to assist migration.

This is technically feasible because what we are planning to deliver for 3.0 is nearly ready, and the API has not needed to change recently.

> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

This has not been a requirement for DSv2 development so far. If this is a new requirement, then we should not do a 2.5 release.

> 5. How long does it take? Is it possible before 3.0.0-preview? Who will work on that backporting?

As I said, I'm already going to do this work, so I'm offering to release it to the community. I don't know how long it will take, but this work and 3.0-preview are not mutually exclusive.

> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in 2020 Summer)?

It is useful to me, so I assume it is useful to others.

I also think it is unlikely that 3.1 will need to make API changes to DSv2. There may be some bugs found, but I don't think we will break API compatibility so quickly. Most of the changes to the API will require only additions.

> If you have a working branch, please share with us.

I don't have a branch to share.


On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, Ryan.

This thread has many replied as you see. That is the evidence that the community is interested in your suggestion a lot.

> I'm offering to help build a stable release without breaking changes. But if there is no community interest in it, I'm happy to drop this.

In this thread, the root cause of the disagreement is due to the lack of supporting evidence for your claims.

1. Is DSv2 stable in `master`?
2. If then, what subset of DSv2 patches does Ryan is suggesting backporting?
3. How much those backporting DSv2 patches looks differently in `branch-2.4`?
4. What does he mean by `without breaking changes? Is it technically feasible?
    Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between 2.5.x DSv2 and 3.0.0 DSv2)
5. How long does it take? Is it possible before 3.0.0-preview? Who will work on that backporting?
6. Is this meaningful if 2.5 and 3.1 become different again too soon (in 2020 Summer)?

We are SW engineers.
If you have a working branch, please share with us.
It will help us understand your suggestion and this discussion.
We can help you verify that branch achieves your goal.
The branch is tested already, isn't it?

Bests,
Dongjoon.




On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <ho...@pigscanfly.ca>> wrote:
I would personally love to see us provide a gentle migration path to Spark 3 especially if much of the work is already going to happen anyways.

Maybe giving it a different name (eg something like Spark-2-to-3-transitional) would make it more clear about its intended purpose and encourage folks to move to 3 when they can?

On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid>> wrote:
My understanding is that 3.0-preview is not going to be a production-ready release. For those of us that have been using backports of DSv2 in production, that doesn't help.

It also doesn't help as a stepping stone because users would need to handle all of the incompatible changes in 3.0. Using 3.0-preview would be an unstable release with breaking changes instead of a stable release without the breaking changes.

I'm offering to help build a stable release without breaking changes. But if there is no community interest in it, I'm happy to drop this.

On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com>> wrote:
+1 for Matei's as well.
On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com>> wrote:
I agree with Matei too.

Thanks,
Marco

Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <do...@gmail.com>> ha scritto:
+1 for Matei's suggestion!

Bests,
Dongjoon.

On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>> wrote:
If the goal is to get people to try the DSv2 API and build DSv2 data sources, can we recommend the 3.0-preview release for this? That would get people shifting to 3.0 faster, which is probably better overall compared to maintaining two major versions. There’s not that much else changing in 3.0 if you already want to update your Java version.


On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>> wrote:

> If you insist we shouldn't change the unstable temporary API in 3.x . . .

Not what I'm saying at all. I said we should carefully consider whether a breaking change is the right decision in the 3.x line.

All I'm suggesting is that we can make a 2.5 release with the feature and an API that is the same as the one in 3.0.

> I also don't get this backporting a giant feature to 2.x line

I am planning to do this so we can use DSv2 before 3.0 is released. Then we can have a source implementation that works in both 2.x and 3.0 to make the transition easier. Since I'm already doing the work, I'm offering to share it with the community.


On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>> wrote:

Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd say we should gate 3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested by others in the thread, DSv2 would be one of the main reasons people upgrade to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 3.0 entirely and backport all the features to 2.x?



On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com>> wrote:
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>> wrote:
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change.

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>> wrote:
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released

The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>> wrote:
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>> wrote:

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.

On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>> wrote:
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:

The class is in catalyst, which states: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

/**
* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
* considered an internal API to Spark SQL and are subject to change between minor releases.
*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example,

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

/**
* A UTF-8 String for internal Spark use.
* <p>
* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
* search, see http://en.wikipedia.org/wiki/UTF-8 for details.
* <p>
* Note: This is not designed for general use cases, should not be used outside SQL.
*/


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala

(which again is in catalyst package)


If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.




On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com>> wrote:

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.

On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>> wrote:
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.



On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>> wrote:
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a release.

> All the actual data types are unstable and you guys have completely ignored that.

I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>> wrote:
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.



On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>> wrote:
Hi everyone,

In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.

A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.

Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.

This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.

Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?


rb


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Ryan Blue
Software Engineer
Netflix


--
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


--
Ryan Blue
Software Engineer
Netflix

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
> That's not a new requirement, that's an "implicit" requirement via
semantic versioning.

The expectation is that the DSv2 API will change in minor versions in the
2.x line. The API is marked with the Experimental API annotation to signal
that it can change, and it has been changing.

A requirement to not change this API for a 2.5 release is a new
requirement. I'm fine with that if that's what everyone wants. Like I said,
if we want to add a requirement to not change this API then we shouldn't
release the 2.5 that I'm proposing.

On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim <ka...@gmail.com> wrote:

> >> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>
> > This has not been a requirement for DSv2 development so far. If this is
> a new requirement, then we should not do a 2.5 release.
>
> My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever
> have a chance to think about such requirement - that's why there's no
> restriction on breaking compatibility on codebase. That's not a new
> requirement, that's an "implicit" requirement via semantic versioning. I
> agree that some of APIs have been changed between Spark 2.x versions, but I
> guess the changes in "new" DSv2 would be bigger than summation of changes
> on "old" DSv2 which has been introduced across multiple minor versions.
>
> Suppose we're developers of Spark ecosystem maintaining custom data source
> (forget about developing Spark): I would get some official announcement on
> next minor version, and I want to try it out quickly to see my stuff still
> supports new version. When I change the dependency version everything will
> break. My hopeful expectation would be no issue while upgrading but turns
> out it's not, and even it requires new learning (not only fixing
> compilation failures). It would just make me giving up support Spark 2.5 or
> at least I won't follow up such change quickly. IMHO 3.0-techpreview has
> advantage here (assuming we provide maven artifacts as well as official
> announcement), as it can give us expectation that there're bunch of changes
> given it's a new major version. It also provides bunch of time to try
> adopting it before the version is officially released.
>
>
> On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> From those questions, I can see that there is significant confusion about
>> what I'm proposing, so let me try to clear it up.
>>
>> > 1. Is DSv2 stable in `master`?
>>
>> DSv2 has reached a stable API that is capable of supporting all of the
>> features we intend to deliver for Spark 3.0. The proposal is to backport
>> the same API and features for Spark 2.5.
>>
>> I am not saying that this API won't change after 3.0. Notably, Reynold
>> wants to change the use of InternalRow. But, these changes are after 3.0
>> and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
>> releases. I also doubt that breaking changes would happen by 3.1.
>>
>> > 2. If then, what subset of DSv2 patches does Ryan is suggesting
>> backporting?
>>
>> I am proposing backporting what we intend to deliver for 3.0: the API
>> currently in master, SQL support, and multi-catalog support.
>>
>> > 3. How much those backporting DSv2 patches looks differently in
>> `branch-2.4`?
>>
>> DSv2 is mostly an addition located in the `connector` package. It also
>> changes some parts of the SQL parser and adds parsed plans, as well as new
>> rules to convert from parsed plans. This is not an invasive change because
>> we kept most of DSv2 separate. DSv2 should be nearly identical between the
>> two branches.
>>
>> > 4. What does he mean by `without breaking changes? Is it technically
>> feasible?
>>
>> DSv2 is marked unstable in the 2.x line and changes between releases. The
>> API changed between 2.3 and 2.4, so this would be no different. But, we
>> would keep the API the same between 2.5 and 3.0 to assist migration.
>>
>> This is technically feasible because what we are planning to deliver for
>> 3.0 is nearly ready, and the API has not needed to change recently.
>>
>> > Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>>
>> This has not been a requirement for DSv2 development so far. If this is a
>> new requirement, then we should not do a 2.5 release.
>>
>> > 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>> work on that backporting?
>>
>> As I said, I'm already going to do this work, so I'm offering to release
>> it to the community. I don't know how long it will take, but this work and
>> 3.0-preview are not mutually exclusive.
>>
>> > 6. Is this meaningful if 2.5 and 3.1 become different again too soon
>> (in 2020 Summer)?
>>
>> It is useful to me, so I assume it is useful to others.
>>
>> I also think it is unlikely that 3.1 will need to make API changes to
>> DSv2. There may be some bugs found, but I don't think we will break API
>> compatibility so quickly. Most of the changes to the API will require only
>> additions.
>>
>> > If you have a working branch, please share with us.
>>
>> I don't have a branch to share.
>>
>>
>> On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, Ryan.
>>>
>>> This thread has many replied as you see. That is the evidence that the
>>> community is interested in your suggestion a lot.
>>>
>>> > I'm offering to help build a stable release without breaking changes.
>>> But if there is no community interest in it, I'm happy to drop this.
>>>
>>> In this thread, the root cause of the disagreement is due to the lack of
>>> supporting evidence for your claims.
>>>
>>> 1. Is DSv2 stable in `master`?
>>> 2. If then, what subset of DSv2 patches does Ryan is suggesting
>>> backporting?
>>> 3. How much those backporting DSv2 patches looks differently in
>>> `branch-2.4`?
>>> 4. What does he mean by `without breaking changes? Is it technically
>>> feasible?
>>>     Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
>>> 2.5.x DSv2 and 3.0.0 DSv2)
>>> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>>> work on that backporting?
>>> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
>>> 2020 Summer)?
>>>
>>> We are SW engineers.
>>> If you have a working branch, please share with us.
>>> It will help us understand your suggestion and this discussion.
>>> We can help you verify that branch achieves your goal.
>>> The branch is tested already, isn't it?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> I would personally love to see us provide a gentle migration path to
>>>> Spark 3 especially if much of the work is already going to happen anyways.
>>>>
>>>> Maybe giving it a different name (eg something like
>>>> Spark-2-to-3-transitional) would make it more clear about its intended
>>>> purpose and encourage folks to move to 3 when they can?
>>>>
>>>> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> My understanding is that 3.0-preview is not going to be a
>>>>> production-ready release. For those of us that have been using backports of
>>>>> DSv2 in production, that doesn't help.
>>>>>
>>>>> It also doesn't help as a stepping stone because users would need to
>>>>> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
>>>>> an unstable release with breaking changes instead of a stable release
>>>>> without the breaking changes.
>>>>>
>>>>> I'm offering to help build a stable release without breaking changes.
>>>>> But if there is no community interest in it, I'm happy to drop this.
>>>>>
>>>>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 for Matei's as well.
>>>>>>
>>>>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Matei too.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marco
>>>>>>>
>>>>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> ha scritto:
>>>>>>>
>>>>>>>> +1 for Matei's suggestion!
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <
>>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> If the goal is to get people to try the DSv2 API and build DSv2
>>>>>>>>> data sources, can we recommend the 3.0-preview release for this? That would
>>>>>>>>> get people shifting to 3.0 faster, which is probably better overall
>>>>>>>>> compared to maintaining two major versions. There’s not that much else
>>>>>>>>> changing in 3.0 if you already want to update your Java version.
>>>>>>>>>
>>>>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > If you insist we shouldn't change the unstable temporary API in
>>>>>>>>> 3.x . . .
>>>>>>>>>
>>>>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>>>>>
>>>>>>>>> All I'm suggesting is that we can make a 2.5 release with the
>>>>>>>>> feature and an API that is the same as the one in 3.0.
>>>>>>>>>
>>>>>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>>>>>
>>>>>>>>> I am planning to do this so we can use DSv2 before 3.0 is
>>>>>>>>> released. Then we can have a source implementation that works in both 2.x
>>>>>>>>> and 3.0 to make the transition easier. Since I'm already doing the work,
>>>>>>>>> I'm offering to share it with the community.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Because for example we'd need to move the location of
>>>>>>>>>> InternalRow, breaking the package name. If you insist we shouldn't change
>>>>>>>>>> the unstable temporary API in 3.x to maintain compatibility with 3.0, which
>>>>>>>>>> is totally different from my understanding of the situation when you
>>>>>>>>>> exposed it, then I'd say we should gate 3.0 on having a stable row
>>>>>>>>>> interface.
>>>>>>>>>>
>>>>>>>>>> I also don't get this backporting a giant feature to 2.x line ...
>>>>>>>>>> as suggested by others in the thread, DSv2 would be one of the main reasons
>>>>>>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>>>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Why would that require an incompatible change?
>>>>>>>>>>>
>>>>>>>>>>> We *could* make an incompatible change and remove support for
>>>>>>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How would you not make incompatible changes in 3.x? As
>>>>>>>>>>>> discussed the InternalRow API is not stable and needs to change.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Making downstream to diverge their implementation heavily
>>>>>>>>>>>>> between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're right that the API has been evolving in the 2.x
>>>>>>>>>>>>> line. But, it is now reasonably stable with respect to the current feature
>>>>>>>>>>>>> set and we should not need to break compatibility in the 3.x line. Because
>>>>>>>>>>>>> we have reached our goals for the 3.0 release, we can backport at least
>>>>>>>>>>>>> those features to 2.x and confidently have an API that works in both a 2.x
>>>>>>>>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > I'd rather say preparation of Spark 2.5 should be started
>>>>>>>>>>>>> after Spark 3.0 is officially released
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reason I'm suggesting this is that I'm already going to do
>>>>>>>>>>>>> the work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>>>>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>>>>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>>>>>>>>> work, then I'd be happy to share that with the community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can
>>>>>>>>>>>>> work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the
>>>>>>>>>>>>> work is about complete so we can easily release the same set of features
>>>>>>>>>>>>> and API in 2.5 and 3.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we decide for some reason to wait until after 3.0 is
>>>>>>>>>>>>> released, I don't know that there is much value in a 2.5. The purpose is to
>>>>>>>>>>>>> be a step toward 3.0, and releasing that step after 3.0 doesn't seem
>>>>>>>>>>>>> helpful to me. It also wouldn't get these features out any sooner than 3.0,
>>>>>>>>>>>>> as a 2.5 release probably would, given the work needed to validate the
>>>>>>>>>>>>> incompatible changes in 3.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > DSv2 change would be the major backward incompatibility
>>>>>>>>>>>>> which Spark 2.x users may hesitate to upgrade
>>>>>>>>>>>>>
>>>>>>>>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so
>>>>>>>>>>>>> this is expected. I don't think it will need incompatible changes in the
>>>>>>>>>>>>> 3.x line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <
>>>>>>>>>>>>> kabhwan@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I
>>>>>>>>>>>>>> needed to deal with this as the change made confusion on my PRs...), but my
>>>>>>>>>>>>>> bet is that DSv2 would be already changed in incompatible way, at least who
>>>>>>>>>>>>>> works for custom DataSource. Making downstream to diverge their
>>>>>>>>>>>>>> implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>>>>>> a good experience - especially we are not completely closed the chance
>>>>>>>>>>>>>> to further modify DSv2, and the change could be backward incompatible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we really want to bring the DSv2 change to 2.x version
>>>>>>>>>>>>>> line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2,
>>>>>>>>>>>>>> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0
>>>>>>>>>>>>>> is officially released, honestly even later than that, say, getting some
>>>>>>>>>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>>>>>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>>>>>>>>>> be frustrated to upgrade to next minor version.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Btw, do we have any specific target users for this?
>>>>>>>>>>>>>> Personally DSv2 change would be the major backward incompatibility which
>>>>>>>>>>>>>> Spark 2.x users may hesitate to upgrade, so they might be already prepared
>>>>>>>>>>>>>> to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you mean you want to have a breaking API change between
>>>>>>>>>>>>>>> 3.0 and 3.1?
>>>>>>>>>>>>>>> I believe we follow Semantic Versioning (
>>>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don’t think we need to gate a 3.0 release on making a
>>>>>>>>>>>>>>>> more stable version of InternalRow
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sounds like we agree, then. We will use it for 3.0, but
>>>>>>>>>>>>>>>> there are known problems with it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will
>>>>>>>>>>>>>>>> change and progress towards more stable, but will have to break certain
>>>>>>>>>>>>>>>> APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Why do you think we will need to break certain APIs before
>>>>>>>>>>>>>>>> 3.0?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’m only suggesting that we release the same support in a
>>>>>>>>>>>>>>>> 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0
>>>>>>>>>>>>>>>> goals, it seems like we can certainly do that. We just won’t add any
>>>>>>>>>>>>>>>> breaking changes before 3.1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <
>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think we need to gate a 3.0 release on making a
>>>>>>>>>>>>>>>>> more stable version of InternalRow, but thinking we'd have dsv2 working in
>>>>>>>>>>>>>>>>> both 3.x (which will change and progress towards more stable, but will have
>>>>>>>>>>>>>>>>> to break certain APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To point out some problems with InternalRow that you think
>>>>>>>>>>>>>>>>> are already pragmatic and stable:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The class is in catalyst, which states:
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>>> * Catalyst is a library for manipulating relational query
>>>>>>>>>>>>>>>>> plans.  All classes in catalyst are
>>>>>>>>>>>>>>>>> * considered an internal API to Spark SQL and are subject
>>>>>>>>>>>>>>>>> to change between minor releases.
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is no even any annotation on the interface.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The entire dependency chain were created to be private,
>>>>>>>>>>>>>>>>> and tightly coupled with internal implementations. For example,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can
>>>>>>>>>>>>>>>>> be used for comparison,
>>>>>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for
>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>>> * Note: This is not designed for general use cases, should
>>>>>>>>>>>>>>>>> not be used outside SQL.
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (which again is in catalyst package)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If you want to argue this way, you might as well argue we
>>>>>>>>>>>>>>>>> should make the entire catalyst package public to be pragmatic and not
>>>>>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This isn’t quite accurate. The change I made was to use
>>>>>>>>>>>>>>>>>> InternalRow instead of UnsafeRow, which is a specific
>>>>>>>>>>>>>>>>>> implementation of InternalRow. Exposing this API has
>>>>>>>>>>>>>>>>>> always been a part of DSv2 and while both you and I did some work to avoid
>>>>>>>>>>>>>>>>>> this, we are still in the phase of starting with that API.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that any change to InternalRow would be very costly
>>>>>>>>>>>>>>>>>> to implement because this interface is widely used. That is why I think we
>>>>>>>>>>>>>>>>>> can certainly consider it stable enough to use here, and that’s probably
>>>>>>>>>>>>>>>>>> why UnsafeRow was part of the original proposal.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, the goal for 3.0 was not to replace the use
>>>>>>>>>>>>>>>>>> of InternalRow, it was to get the majority of SQL
>>>>>>>>>>>>>>>>>> working on top of the interface added after 2.4. That’s done and stable, so
>>>>>>>>>>>>>>>>>> I think a 2.5 release with it is also reasonable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To push back, while I agree we should not drastically
>>>>>>>>>>>>>>>>>>> change "InternalRow", there are a lot of changes that need to happen to
>>>>>>>>>>>>>>>>>>> make it stable. For example, none of the publicly exposed interfaces should
>>>>>>>>>>>>>>>>>>> be in the Catalyst package or the unsafe package. External implementations
>>>>>>>>>>>>>>>>>>> should be decoupled from the internal implementations, with cheap ways to
>>>>>>>>>>>>>>>>>>> convert back and forth.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>>>>>>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No, I think it is reasonably stable and very close to
>>>>>>>>>>>>>>>>>>>> being ready for a release.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > All the actual data types are unstable and you guys
>>>>>>>>>>>>>>>>>>>> have completely ignored that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think what you're referring to is the use of
>>>>>>>>>>>>>>>>>>>> `InternalRow`. That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>>>>>> using it. In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>>>>>> delay 3.0 until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> While I understand the motivation for a better solution
>>>>>>>>>>>>>>>>>>>> here, I think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x,
>>>>>>>>>>>>>>>>>>>> that seems too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>>>>>> needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I believe that those of us working on DSv2 are
>>>>>>>>>>>>>>>>>>>> confident about the current stability. We set goals for what to get into
>>>>>>>>>>>>>>>>>>>> the 3.0 release months ago and have very nearly reached the point where we
>>>>>>>>>>>>>>>>>>>> are ready for that release.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think instability would be a problem in
>>>>>>>>>>>>>>>>>>>> maintaining compatibility between the 2.5 version and the 3.0 version. If
>>>>>>>>>>>>>>>>>>>> we find that we need to make API changes (other than additions) then we can
>>>>>>>>>>>>>>>>>>>> make those in the 3.1 release. Because the goals we set for the 3.0 release
>>>>>>>>>>>>>>>>>>>> have been reached with the current API and if we are ready to release 3.0,
>>>>>>>>>>>>>>>>>>>> we can release a 2.5 with the same API.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> DSv2 is far from stable right? All the actual data
>>>>>>>>>>>>>>>>>>>>> types are unstable and you guys have completely ignored that. We'd need to
>>>>>>>>>>>>>>>>>>>>> work on that and that will be a breaking change. If the goal is to make
>>>>>>>>>>>>>>>>>>>>> DSv2 work across 3.x and 2.x, that seems too invasive of a change to
>>>>>>>>>>>>>>>>>>>>> backport once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In the DSv2 sync this week, we talked about a
>>>>>>>>>>>>>>>>>>>>>> possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and
>>>>>>>>>>>>>>>>>>>>>> Java 11 support added.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> A Spark 2.5 release with these two additions will
>>>>>>>>>>>>>>>>>>>>>> help people migrate to Spark 3.0 when it is released because they will be
>>>>>>>>>>>>>>>>>>>>>> able to use a single implementation for DSv2 sources that works in both 2.5
>>>>>>>>>>>>>>>>>>>>>> and 3.0. Similarly, upgrading to 3.0 won't also require also updating to
>>>>>>>>>>>>>>>>>>>>>> Java 11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>>>>>> fewer major changes.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Another reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>>>>>> people are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>>>>>> DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>>>>>> line, so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This release line would just consist of backports
>>>>>>>>>>>>>>>>>>>>>> like DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>>>>>> release small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>>>>>> distract from the 3.0 release.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there
>>>>>>>>>>>>>>>>>>>>>> any concerns about this plan?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>
>>>>>>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Jungtaek Lim <ka...@gmail.com>.
>> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

> This has not been a requirement for DSv2 development so far. If this is a
new requirement, then we should not do a 2.5 release.

My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever
have a chance to think about such requirement - that's why there's no
restriction on breaking compatibility on codebase. That's not a new
requirement, that's an "implicit" requirement via semantic versioning. I
agree that some of APIs have been changed between Spark 2.x versions, but I
guess the changes in "new" DSv2 would be bigger than summation of changes
on "old" DSv2 which has been introduced across multiple minor versions.

Suppose we're developers of Spark ecosystem maintaining custom data source
(forget about developing Spark): I would get some official announcement on
next minor version, and I want to try it out quickly to see my stuff still
supports new version. When I change the dependency version everything will
break. My hopeful expectation would be no issue while upgrading but turns
out it's not, and even it requires new learning (not only fixing
compilation failures). It would just make me giving up support Spark 2.5 or
at least I won't follow up such change quickly. IMHO 3.0-techpreview has
advantage here (assuming we provide maven artifacts as well as official
announcement), as it can give us expectation that there're bunch of changes
given it's a new major version. It also provides bunch of time to try
adopting it before the version is officially released.


On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue <rb...@netflix.com> wrote:

> From those questions, I can see that there is significant confusion about
> what I'm proposing, so let me try to clear it up.
>
> > 1. Is DSv2 stable in `master`?
>
> DSv2 has reached a stable API that is capable of supporting all of the
> features we intend to deliver for Spark 3.0. The proposal is to backport
> the same API and features for Spark 2.5.
>
> I am not saying that this API won't change after 3.0. Notably, Reynold
> wants to change the use of InternalRow. But, these changes are after 3.0
> and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
> releases. I also doubt that breaking changes would happen by 3.1.
>
> > 2. If then, what subset of DSv2 patches does Ryan is suggesting
> backporting?
>
> I am proposing backporting what we intend to deliver for 3.0: the API
> currently in master, SQL support, and multi-catalog support.
>
> > 3. How much those backporting DSv2 patches looks differently in
> `branch-2.4`?
>
> DSv2 is mostly an addition located in the `connector` package. It also
> changes some parts of the SQL parser and adds parsed plans, as well as new
> rules to convert from parsed plans. This is not an invasive change because
> we kept most of DSv2 separate. DSv2 should be nearly identical between the
> two branches.
>
> > 4. What does he mean by `without breaking changes? Is it technically
> feasible?
>
> DSv2 is marked unstable in the 2.x line and changes between releases. The
> API changed between 2.3 and 2.4, so this would be no different. But, we
> would keep the API the same between 2.5 and 3.0 to assist migration.
>
> This is technically feasible because what we are planning to deliver for
> 3.0 is nearly ready, and the API has not needed to change recently.
>
> > Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>
> This has not been a requirement for DSv2 development so far. If this is a
> new requirement, then we should not do a 2.5 release.
>
> > 5. How long does it take? Is it possible before 3.0.0-preview? Who will
> work on that backporting?
>
> As I said, I'm already going to do this work, so I'm offering to release
> it to the community. I don't know how long it will take, but this work and
> 3.0-preview are not mutually exclusive.
>
> > 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
> 2020 Summer)?
>
> It is useful to me, so I assume it is useful to others.
>
> I also think it is unlikely that 3.1 will need to make API changes to
> DSv2. There may be some bugs found, but I don't think we will break API
> compatibility so quickly. Most of the changes to the API will require only
> additions.
>
> > If you have a working branch, please share with us.
>
> I don't have a branch to share.
>
>
> On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, Ryan.
>>
>> This thread has many replied as you see. That is the evidence that the
>> community is interested in your suggestion a lot.
>>
>> > I'm offering to help build a stable release without breaking changes.
>> But if there is no community interest in it, I'm happy to drop this.
>>
>> In this thread, the root cause of the disagreement is due to the lack of
>> supporting evidence for your claims.
>>
>> 1. Is DSv2 stable in `master`?
>> 2. If then, what subset of DSv2 patches does Ryan is suggesting
>> backporting?
>> 3. How much those backporting DSv2 patches looks differently in
>> `branch-2.4`?
>> 4. What does he mean by `without breaking changes? Is it technically
>> feasible?
>>     Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
>> 2.5.x DSv2 and 3.0.0 DSv2)
>> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>> work on that backporting?
>> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
>> 2020 Summer)?
>>
>> We are SW engineers.
>> If you have a working branch, please share with us.
>> It will help us understand your suggestion and this discussion.
>> We can help you verify that branch achieves your goal.
>> The branch is tested already, isn't it?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> I would personally love to see us provide a gentle migration path to
>>> Spark 3 especially if much of the work is already going to happen anyways.
>>>
>>> Maybe giving it a different name (eg something like
>>> Spark-2-to-3-transitional) would make it more clear about its intended
>>> purpose and encourage folks to move to 3 when they can?
>>>
>>> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> My understanding is that 3.0-preview is not going to be a
>>>> production-ready release. For those of us that have been using backports of
>>>> DSv2 in production, that doesn't help.
>>>>
>>>> It also doesn't help as a stepping stone because users would need to
>>>> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
>>>> an unstable release with breaking changes instead of a stable release
>>>> without the breaking changes.
>>>>
>>>> I'm offering to help build a stable release without breaking changes.
>>>> But if there is no community interest in it, I'm happy to drop this.
>>>>
>>>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for Matei's as well.
>>>>>
>>>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree with Matei too.
>>>>>>
>>>>>> Thanks,
>>>>>> Marco
>>>>>>
>>>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> ha scritto:
>>>>>>
>>>>>>> +1 for Matei's suggestion!
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <
>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>
>>>>>>>> If the goal is to get people to try the DSv2 API and build DSv2
>>>>>>>> data sources, can we recommend the 3.0-preview release for this? That would
>>>>>>>> get people shifting to 3.0 faster, which is probably better overall
>>>>>>>> compared to maintaining two major versions. There’s not that much else
>>>>>>>> changing in 3.0 if you already want to update your Java version.
>>>>>>>>
>>>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > If you insist we shouldn't change the unstable temporary API in
>>>>>>>> 3.x . . .
>>>>>>>>
>>>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>>>>
>>>>>>>> All I'm suggesting is that we can make a 2.5 release with the
>>>>>>>> feature and an API that is the same as the one in 3.0.
>>>>>>>>
>>>>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>>>>
>>>>>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>>>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>>>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>>>>>> to share it with the community.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Because for example we'd need to move the location of InternalRow,
>>>>>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>>>>>> different from my understanding of the situation when you exposed it, then
>>>>>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>>>>>
>>>>>>>>> I also don't get this backporting a giant feature to 2.x line ...
>>>>>>>>> as suggested by others in the thread, DSv2 would be one of the main reasons
>>>>>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Why would that require an incompatible change?
>>>>>>>>>>
>>>>>>>>>> We *could* make an incompatible change and remove support for
>>>>>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> How would you not make incompatible changes in 3.x? As discussed
>>>>>>>>>>> the InternalRow API is not stable and needs to change.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Making downstream to diverge their implementation heavily
>>>>>>>>>>>> between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>>>>>>>
>>>>>>>>>>>> You're right that the API has been evolving in the 2.x
>>>>>>>>>>>> line. But, it is now reasonably stable with respect to the current feature
>>>>>>>>>>>> set and we should not need to break compatibility in the 3.x line. Because
>>>>>>>>>>>> we have reached our goals for the 3.0 release, we can backport at least
>>>>>>>>>>>> those features to 2.x and confidently have an API that works in both a 2.x
>>>>>>>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>>>>>>>
>>>>>>>>>>>> > I'd rather say preparation of Spark 2.5 should be started
>>>>>>>>>>>> after Spark 3.0 is officially released
>>>>>>>>>>>>
>>>>>>>>>>>> The reason I'm suggesting this is that I'm already going to do
>>>>>>>>>>>> the work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>>>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>>>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>>>>>>>> work, then I'd be happy to share that with the community.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work
>>>>>>>>>>>> on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work
>>>>>>>>>>>> is about complete so we can easily release the same set of features and API
>>>>>>>>>>>> in 2.5 and 3.0.
>>>>>>>>>>>>
>>>>>>>>>>>> If we decide for some reason to wait until after 3.0 is
>>>>>>>>>>>> released, I don't know that there is much value in a 2.5. The purpose is to
>>>>>>>>>>>> be a step toward 3.0, and releasing that step after 3.0 doesn't seem
>>>>>>>>>>>> helpful to me. It also wouldn't get these features out any sooner than 3.0,
>>>>>>>>>>>> as a 2.5 release probably would, given the work needed to validate the
>>>>>>>>>>>> incompatible changes in 3.0.
>>>>>>>>>>>>
>>>>>>>>>>>> > DSv2 change would be the major backward incompatibility which
>>>>>>>>>>>> Spark 2.x users may hesitate to upgrade
>>>>>>>>>>>>
>>>>>>>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so
>>>>>>>>>>>> this is expected. I don't think it will need incompatible changes in the
>>>>>>>>>>>> 3.x line.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I
>>>>>>>>>>>>> needed to deal with this as the change made confusion on my PRs...), but my
>>>>>>>>>>>>> bet is that DSv2 would be already changed in incompatible way, at least who
>>>>>>>>>>>>> works for custom DataSource. Making downstream to diverge their
>>>>>>>>>>>>> implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>>>>> a good experience - especially we are not completely closed the chance
>>>>>>>>>>>>> to further modify DSv2, and the change could be backward incompatible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we really want to bring the DSv2 change to 2.x version line
>>>>>>>>>>>>> to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd
>>>>>>>>>>>>> rather say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>>>>>>>>> officially released, honestly even later than that, say, getting some
>>>>>>>>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>>>>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>>>>>>>>> be frustrated to upgrade to next minor version.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw, do we have any specific target users for this? Personally
>>>>>>>>>>>>> DSv2 change would be the major backward incompatibility which Spark 2.x
>>>>>>>>>>>>> users may hesitate to upgrade, so they might be already prepared to migrate
>>>>>>>>>>>>> to Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you mean you want to have a breaking API change between
>>>>>>>>>>>>>> 3.0 and 3.1?
>>>>>>>>>>>>>> I believe we follow Semantic Versioning (
>>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don’t think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>>>> stable version of InternalRow
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sounds like we agree, then. We will use it for 3.0, but
>>>>>>>>>>>>>>> there are known problems with it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will
>>>>>>>>>>>>>>> change and progress towards more stable, but will have to break certain
>>>>>>>>>>>>>>> APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why do you think we will need to break certain APIs before
>>>>>>>>>>>>>>> 3.0?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m only suggesting that we release the same support in a
>>>>>>>>>>>>>>> 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0
>>>>>>>>>>>>>>> goals, it seems like we can certainly do that. We just won’t add any
>>>>>>>>>>>>>>> breaking changes before 3.1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <
>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think we need to gate a 3.0 release on making a
>>>>>>>>>>>>>>>> more stable version of InternalRow, but thinking we'd have dsv2 working in
>>>>>>>>>>>>>>>> both 3.x (which will change and progress towards more stable, but will have
>>>>>>>>>>>>>>>> to break certain APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To point out some problems with InternalRow that you think
>>>>>>>>>>>>>>>> are already pragmatic and stable:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The class is in catalyst, which states:
>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>> * Catalyst is a library for manipulating relational query
>>>>>>>>>>>>>>>> plans.  All classes in catalyst are
>>>>>>>>>>>>>>>> * considered an internal API to Spark SQL and are subject
>>>>>>>>>>>>>>>> to change between minor releases.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is no even any annotation on the interface.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The entire dependency chain were created to be private, and
>>>>>>>>>>>>>>>> tightly coupled with internal implementations. For example,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be
>>>>>>>>>>>>>>>> used for comparison,
>>>>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for
>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>>> * Note: This is not designed for general use cases, should
>>>>>>>>>>>>>>>> not be used outside SQL.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (which again is in catalyst package)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you want to argue this way, you might as well argue we
>>>>>>>>>>>>>>>> should make the entire catalyst package public to be pragmatic and not
>>>>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <
>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This isn’t quite accurate. The change I made was to use
>>>>>>>>>>>>>>>>> InternalRow instead of UnsafeRow, which is a specific
>>>>>>>>>>>>>>>>> implementation of InternalRow. Exposing this API has
>>>>>>>>>>>>>>>>> always been a part of DSv2 and while both you and I did some work to avoid
>>>>>>>>>>>>>>>>> this, we are still in the phase of starting with that API.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that any change to InternalRow would be very costly
>>>>>>>>>>>>>>>>> to implement because this interface is widely used. That is why I think we
>>>>>>>>>>>>>>>>> can certainly consider it stable enough to use here, and that’s probably
>>>>>>>>>>>>>>>>> why UnsafeRow was part of the original proposal.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, the goal for 3.0 was not to replace the use
>>>>>>>>>>>>>>>>> of InternalRow, it was to get the majority of SQL working
>>>>>>>>>>>>>>>>> on top of the interface added after 2.4. That’s done and stable, so I think
>>>>>>>>>>>>>>>>> a 2.5 release with it is also reasonable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To push back, while I agree we should not drastically
>>>>>>>>>>>>>>>>>> change "InternalRow", there are a lot of changes that need to happen to
>>>>>>>>>>>>>>>>>> make it stable. For example, none of the publicly exposed interfaces should
>>>>>>>>>>>>>>>>>> be in the Catalyst package or the unsafe package. External implementations
>>>>>>>>>>>>>>>>>> should be decoupled from the internal implementations, with cheap ways to
>>>>>>>>>>>>>>>>>> convert back and forth.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>>>>>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> No, I think it is reasonably stable and very close to
>>>>>>>>>>>>>>>>>>> being ready for a release.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > All the actual data types are unstable and you guys
>>>>>>>>>>>>>>>>>>> have completely ignored that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think what you're referring to is the use of
>>>>>>>>>>>>>>>>>>> `InternalRow`. That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>>>>> using it. In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>>>>> delay 3.0 until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While I understand the motivation for a better solution
>>>>>>>>>>>>>>>>>>> here, I think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x,
>>>>>>>>>>>>>>>>>>> that seems too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>>>>> needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I believe that those of us working on DSv2 are confident
>>>>>>>>>>>>>>>>>>> about the current stability. We set goals for what to get into the 3.0
>>>>>>>>>>>>>>>>>>> release months ago and have very nearly reached the point where we are
>>>>>>>>>>>>>>>>>>> ready for that release.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think instability would be a problem in
>>>>>>>>>>>>>>>>>>> maintaining compatibility between the 2.5 version and the 3.0 version. If
>>>>>>>>>>>>>>>>>>> we find that we need to make API changes (other than additions) then we can
>>>>>>>>>>>>>>>>>>> make those in the 3.1 release. Because the goals we set for the 3.0 release
>>>>>>>>>>>>>>>>>>> have been reached with the current API and if we are ready to release 3.0,
>>>>>>>>>>>>>>>>>>> we can release a 2.5 with the same API.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> DSv2 is far from stable right? All the actual data
>>>>>>>>>>>>>>>>>>>> types are unstable and you guys have completely ignored that. We'd need to
>>>>>>>>>>>>>>>>>>>> work on that and that will be a breaking change. If the goal is to make
>>>>>>>>>>>>>>>>>>>> DSv2 work across 3.x and 2.x, that seems too invasive of a change to
>>>>>>>>>>>>>>>>>>>> backport once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible
>>>>>>>>>>>>>>>>>>>>> Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11
>>>>>>>>>>>>>>>>>>>>> support added.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> A Spark 2.5 release with these two additions will help
>>>>>>>>>>>>>>>>>>>>> people migrate to Spark 3.0 when it is released because they will be able
>>>>>>>>>>>>>>>>>>>>> to use a single implementation for DSv2 sources that works in both 2.5 and
>>>>>>>>>>>>>>>>>>>>> 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java
>>>>>>>>>>>>>>>>>>>>> 11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>>>>> fewer major changes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Another reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>>>>> people are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>>>>> DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>>>>> line, so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This release line would just consist of backports like
>>>>>>>>>>>>>>>>>>>>> DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>>>>> release small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>>>>> distract from the 3.0 release.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there
>>>>>>>>>>>>>>>>>>>>> any concerns about this plan?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>>
>>>>>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
From those questions, I can see that there is significant confusion about
what I'm proposing, so let me try to clear it up.

> 1. Is DSv2 stable in `master`?

DSv2 has reached a stable API that is capable of supporting all of the
features we intend to deliver for Spark 3.0. The proposal is to backport
the same API and features for Spark 2.5.

I am not saying that this API won't change after 3.0. Notably, Reynold
wants to change the use of InternalRow. But, these changes are after 3.0
and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
releases. I also doubt that breaking changes would happen by 3.1.

> 2. If then, what subset of DSv2 patches does Ryan is suggesting
backporting?

I am proposing backporting what we intend to deliver for 3.0: the API
currently in master, SQL support, and multi-catalog support.

> 3. How much those backporting DSv2 patches looks differently in
`branch-2.4`?

DSv2 is mostly an addition located in the `connector` package. It also
changes some parts of the SQL parser and adds parsed plans, as well as new
rules to convert from parsed plans. This is not an invasive change because
we kept most of DSv2 separate. DSv2 should be nearly identical between the
two branches.

> 4. What does he mean by `without breaking changes? Is it technically
feasible?

DSv2 is marked unstable in the 2.x line and changes between releases. The
API changed between 2.3 and 2.4, so this would be no different. But, we
would keep the API the same between 2.5 and 3.0 to assist migration.

This is technically feasible because what we are planning to deliver for
3.0 is nearly ready, and the API has not needed to change recently.

> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

This has not been a requirement for DSv2 development so far. If this is a
new requirement, then we should not do a 2.5 release.

> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
work on that backporting?

As I said, I'm already going to do this work, so I'm offering to release it
to the community. I don't know how long it will take, but this work and
3.0-preview are not mutually exclusive.

> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
2020 Summer)?

It is useful to me, so I assume it is useful to others.

I also think it is unlikely that 3.1 will need to make API changes to DSv2.
There may be some bugs found, but I don't think we will break API
compatibility so quickly. Most of the changes to the API will require only
additions.

> If you have a working branch, please share with us.

I don't have a branch to share.


On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Ryan.
>
> This thread has many replied as you see. That is the evidence that the
> community is interested in your suggestion a lot.
>
> > I'm offering to help build a stable release without breaking changes.
> But if there is no community interest in it, I'm happy to drop this.
>
> In this thread, the root cause of the disagreement is due to the lack of
> supporting evidence for your claims.
>
> 1. Is DSv2 stable in `master`?
> 2. If then, what subset of DSv2 patches does Ryan is suggesting
> backporting?
> 3. How much those backporting DSv2 patches looks differently in
> `branch-2.4`?
> 4. What does he mean by `without breaking changes? Is it technically
> feasible?
>     Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
> 2.5.x DSv2 and 3.0.0 DSv2)
> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
> work on that backporting?
> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
> 2020 Summer)?
>
> We are SW engineers.
> If you have a working branch, please share with us.
> It will help us understand your suggestion and this discussion.
> We can help you verify that branch achieves your goal.
> The branch is tested already, isn't it?
>
> Bests,
> Dongjoon.
>
>
>
>
> On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> I would personally love to see us provide a gentle migration path to
>> Spark 3 especially if much of the work is already going to happen anyways.
>>
>> Maybe giving it a different name (eg something like
>> Spark-2-to-3-transitional) would make it more clear about its intended
>> purpose and encourage folks to move to 3 when they can?
>>
>> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> My understanding is that 3.0-preview is not going to be a
>>> production-ready release. For those of us that have been using backports of
>>> DSv2 in production, that doesn't help.
>>>
>>> It also doesn't help as a stepping stone because users would need to
>>> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
>>> an unstable release with breaking changes instead of a stable release
>>> without the breaking changes.
>>>
>>> I'm offering to help build a stable release without breaking changes.
>>> But if there is no community interest in it, I'm happy to drop this.
>>>
>>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com>
>>> wrote:
>>>
>>>> +1 for Matei's as well.
>>>>
>>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com> wrote:
>>>>
>>>>> I agree with Matei too.
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> ha scritto:
>>>>>
>>>>>> +1 for Matei's suggestion!
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <
>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>
>>>>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>>>>> if you already want to update your Java version.
>>>>>>>
>>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > If you insist we shouldn't change the unstable temporary API in
>>>>>>> 3.x . . .
>>>>>>>
>>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>>>
>>>>>>> All I'm suggesting is that we can make a 2.5 release with the
>>>>>>> feature and an API that is the same as the one in 3.0.
>>>>>>>
>>>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>>>
>>>>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>>>>> to share it with the community.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Because for example we'd need to move the location of InternalRow,
>>>>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>>>>> different from my understanding of the situation when you exposed it, then
>>>>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>>>>
>>>>>>>> I also don't get this backporting a giant feature to 2.x line ...
>>>>>>>> as suggested by others in the thread, DSv2 would be one of the main reasons
>>>>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Why would that require an incompatible change?
>>>>>>>>>
>>>>>>>>> We *could* make an incompatible change and remove support for
>>>>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> How would you not make incompatible changes in 3.x? As discussed
>>>>>>>>>> the InternalRow API is not stable and needs to change.
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > Making downstream to diverge their implementation heavily
>>>>>>>>>>> between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>>>>>>
>>>>>>>>>>> You're right that the API has been evolving in the 2.x
>>>>>>>>>>> line. But, it is now reasonably stable with respect to the current feature
>>>>>>>>>>> set and we should not need to break compatibility in the 3.x line. Because
>>>>>>>>>>> we have reached our goals for the 3.0 release, we can backport at least
>>>>>>>>>>> those features to 2.x and confidently have an API that works in both a 2.x
>>>>>>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>>>>>>
>>>>>>>>>>> > I'd rather say preparation of Spark 2.5 should be started
>>>>>>>>>>> after Spark 3.0 is officially released
>>>>>>>>>>>
>>>>>>>>>>> The reason I'm suggesting this is that I'm already going to do
>>>>>>>>>>> the work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>>>>>>> work, then I'd be happy to share that with the community.
>>>>>>>>>>>
>>>>>>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work
>>>>>>>>>>> on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work
>>>>>>>>>>> is about complete so we can easily release the same set of features and API
>>>>>>>>>>> in 2.5 and 3.0.
>>>>>>>>>>>
>>>>>>>>>>> If we decide for some reason to wait until after 3.0 is
>>>>>>>>>>> released, I don't know that there is much value in a 2.5. The purpose is to
>>>>>>>>>>> be a step toward 3.0, and releasing that step after 3.0 doesn't seem
>>>>>>>>>>> helpful to me. It also wouldn't get these features out any sooner than 3.0,
>>>>>>>>>>> as a 2.5 release probably would, given the work needed to validate the
>>>>>>>>>>> incompatible changes in 3.0.
>>>>>>>>>>>
>>>>>>>>>>> > DSv2 change would be the major backward incompatibility which
>>>>>>>>>>> Spark 2.x users may hesitate to upgrade
>>>>>>>>>>>
>>>>>>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so
>>>>>>>>>>> this is expected. I don't think it will need incompatible changes in the
>>>>>>>>>>> 3.x line.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I
>>>>>>>>>>>> needed to deal with this as the change made confusion on my PRs...), but my
>>>>>>>>>>>> bet is that DSv2 would be already changed in incompatible way, at least who
>>>>>>>>>>>> works for custom DataSource. Making downstream to diverge their
>>>>>>>>>>>> implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>>>> a good experience - especially we are not completely closed the chance
>>>>>>>>>>>> to further modify DSv2, and the change could be backward incompatible.
>>>>>>>>>>>>
>>>>>>>>>>>> If we really want to bring the DSv2 change to 2.x version line
>>>>>>>>>>>> to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd
>>>>>>>>>>>> rather say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>>>>>>>> officially released, honestly even later than that, say, getting some
>>>>>>>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>>>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>>>>>>>> be frustrated to upgrade to next minor version.
>>>>>>>>>>>>
>>>>>>>>>>>> Btw, do we have any specific target users for this? Personally
>>>>>>>>>>>> DSv2 change would be the major backward incompatibility which Spark 2.x
>>>>>>>>>>>> users may hesitate to upgrade, so they might be already prepared to migrate
>>>>>>>>>>>> to Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean you want to have a breaking API change between 3.0
>>>>>>>>>>>>> and 3.1?
>>>>>>>>>>>>> I believe we follow Semantic Versioning (
>>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>>>>>>>>
>>>>>>>>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <
>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don’t think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>>> stable version of InternalRow
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sounds like we agree, then. We will use it for 3.0, but there
>>>>>>>>>>>>>> are known problems with it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will
>>>>>>>>>>>>>> change and progress towards more stable, but will have to break certain
>>>>>>>>>>>>>> APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why do you think we will need to break certain APIs before
>>>>>>>>>>>>>> 3.0?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’m only suggesting that we release the same support in a 2.5
>>>>>>>>>>>>>> release that we do in 3.0. Since we are nearly finished with the 3.0 goals,
>>>>>>>>>>>>>> it seems like we can certainly do that. We just won’t add any breaking
>>>>>>>>>>>>>> changes before 3.1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <
>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>>>> stable version of InternalRow, but thinking we'd have dsv2 working in both
>>>>>>>>>>>>>>> 3.x (which will change and progress towards more stable, but will have to
>>>>>>>>>>>>>>> break certain APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To point out some problems with InternalRow that you think
>>>>>>>>>>>>>>> are already pragmatic and stable:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The class is in catalyst, which states:
>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>> * Catalyst is a library for manipulating relational query
>>>>>>>>>>>>>>> plans.  All classes in catalyst are
>>>>>>>>>>>>>>> * considered an internal API to Spark SQL and are subject to
>>>>>>>>>>>>>>> change between minor releases.
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There is no even any annotation on the interface.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The entire dependency chain were created to be private, and
>>>>>>>>>>>>>>> tightly coupled with internal implementations. For example,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be
>>>>>>>>>>>>>>> used for comparison,
>>>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for
>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>>> * Note: This is not designed for general use cases, should
>>>>>>>>>>>>>>> not be used outside SQL.
>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (which again is in catalyst package)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you want to argue this way, you might as well argue we
>>>>>>>>>>>>>>> should make the entire catalyst package public to be pragmatic and not
>>>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This isn’t quite accurate. The change I made was to use
>>>>>>>>>>>>>>>> InternalRow instead of UnsafeRow, which is a specific
>>>>>>>>>>>>>>>> implementation of InternalRow. Exposing this API has
>>>>>>>>>>>>>>>> always been a part of DSv2 and while both you and I did some work to avoid
>>>>>>>>>>>>>>>> this, we are still in the phase of starting with that API.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that any change to InternalRow would be very costly
>>>>>>>>>>>>>>>> to implement because this interface is widely used. That is why I think we
>>>>>>>>>>>>>>>> can certainly consider it stable enough to use here, and that’s probably
>>>>>>>>>>>>>>>> why UnsafeRow was part of the original proposal.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>>>>>>>>>>>> InternalRow, it was to get the majority of SQL working on
>>>>>>>>>>>>>>>> top of the interface added after 2.4. That’s done and stable, so I think a
>>>>>>>>>>>>>>>> 2.5 release with it is also reasonable.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To push back, while I agree we should not drastically
>>>>>>>>>>>>>>>>> change "InternalRow", there are a lot of changes that need to happen to
>>>>>>>>>>>>>>>>> make it stable. For example, none of the publicly exposed interfaces should
>>>>>>>>>>>>>>>>> be in the Catalyst package or the unsafe package. External implementations
>>>>>>>>>>>>>>>>> should be decoupled from the internal implementations, with cheap ways to
>>>>>>>>>>>>>>>>> convert back and forth.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>>>>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No, I think it is reasonably stable and very close to
>>>>>>>>>>>>>>>>>> being ready for a release.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > All the actual data types are unstable and you guys
>>>>>>>>>>>>>>>>>> have completely ignored that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think what you're referring to is the use of
>>>>>>>>>>>>>>>>>> `InternalRow`. That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>>>> using it. In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>>>> delay 3.0 until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> While I understand the motivation for a better solution
>>>>>>>>>>>>>>>>>> here, I think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x,
>>>>>>>>>>>>>>>>>> that seems too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>>>> needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I believe that those of us working on DSv2 are confident
>>>>>>>>>>>>>>>>>> about the current stability. We set goals for what to get into the 3.0
>>>>>>>>>>>>>>>>>> release months ago and have very nearly reached the point where we are
>>>>>>>>>>>>>>>>>> ready for that release.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think instability would be a problem in
>>>>>>>>>>>>>>>>>> maintaining compatibility between the 2.5 version and the 3.0 version. If
>>>>>>>>>>>>>>>>>> we find that we need to make API changes (other than additions) then we can
>>>>>>>>>>>>>>>>>> make those in the 3.1 release. Because the goals we set for the 3.0 release
>>>>>>>>>>>>>>>>>> have been reached with the current API and if we are ready to release 3.0,
>>>>>>>>>>>>>>>>>> we can release a 2.5 with the same API.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> DSv2 is far from stable right? All the actual data types
>>>>>>>>>>>>>>>>>>> are unstable and you guys have completely ignored that. We'd need to work
>>>>>>>>>>>>>>>>>>> on that and that will be a breaking change. If the goal is to make DSv2
>>>>>>>>>>>>>>>>>>> work across 3.x and 2.x, that seems too invasive of a change to backport
>>>>>>>>>>>>>>>>>>> once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible
>>>>>>>>>>>>>>>>>>>> Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11
>>>>>>>>>>>>>>>>>>>> support added.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> A Spark 2.5 release with these two additions will help
>>>>>>>>>>>>>>>>>>>> people migrate to Spark 3.0 when it is released because they will be able
>>>>>>>>>>>>>>>>>>>> to use a single implementation for DSv2 sources that works in both 2.5 and
>>>>>>>>>>>>>>>>>>>> 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java
>>>>>>>>>>>>>>>>>>>> 11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>>>> fewer major changes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Another reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>>>> people are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>>>> DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>>>> line, so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This release line would just consist of backports like
>>>>>>>>>>>>>>>>>>>> DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>>>> release small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>>>> distract from the 3.0 release.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there
>>>>>>>>>>>>>>>>>>>> any concerns about this plan?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Ryan.

This thread has many replied as you see. That is the evidence that the
community is interested in your suggestion a lot.

> I'm offering to help build a stable release without breaking changes. But
if there is no community interest in it, I'm happy to drop this.

In this thread, the root cause of the disagreement is due to the lack of
supporting evidence for your claims.

1. Is DSv2 stable in `master`?
2. If then, what subset of DSv2 patches does Ryan is suggesting backporting?
3. How much those backporting DSv2 patches looks differently in
`branch-2.4`?
4. What does he mean by `without breaking changes? Is it technically
feasible?
    Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
2.5.x DSv2 and 3.0.0 DSv2)
5. How long does it take? Is it possible before 3.0.0-preview? Who will
work on that backporting?
6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
2020 Summer)?

We are SW engineers.
If you have a working branch, please share with us.
It will help us understand your suggestion and this discussion.
We can help you verify that branch achieves your goal.
The branch is tested already, isn't it?

Bests,
Dongjoon.




On Mon, Sep 23, 2019 at 10:44 AM Holden Karau <ho...@pigscanfly.ca> wrote:

> I would personally love to see us provide a gentle migration path to Spark
> 3 especially if much of the work is already going to happen anyways.
>
> Maybe giving it a different name (eg something like
> Spark-2-to-3-transitional) would make it more clear about its intended
> purpose and encourage folks to move to 3 when they can?
>
> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> My understanding is that 3.0-preview is not going to be a
>> production-ready release. For those of us that have been using backports of
>> DSv2 in production, that doesn't help.
>>
>> It also doesn't help as a stepping stone because users would need to
>> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
>> an unstable release with breaking changes instead of a stable release
>> without the breaking changes.
>>
>> I'm offering to help build a stable release without breaking changes. But
>> if there is no community interest in it, I'm happy to drop this.
>>
>> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>>> +1 for Matei's as well.
>>>
>>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com> wrote:
>>>
>>>> I agree with Matei too.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>>> dongjoon.hyun@gmail.com> ha scritto:
>>>>
>>>>> +1 for Matei's suggestion!
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>>>> if you already want to update your Java version.
>>>>>>
>>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>>> wrote:
>>>>>>
>>>>>> > If you insist we shouldn't change the unstable temporary API in 3.x
>>>>>> . . .
>>>>>>
>>>>>> Not what I'm saying at all. I said we should carefully
>>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>>
>>>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>>>> and an API that is the same as the one in 3.0.
>>>>>>
>>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>>
>>>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>>>> to share it with the community.
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Because for example we'd need to move the location of InternalRow,
>>>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>>>> different from my understanding of the situation when you exposed it, then
>>>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>>>
>>>>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Why would that require an incompatible change?
>>>>>>>>
>>>>>>>> We *could* make an incompatible change and remove support for
>>>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> How would you not make incompatible changes in 3.x? As discussed
>>>>>>>>> the InternalRow API is not stable and needs to change.
>>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > Making downstream to diverge their implementation heavily
>>>>>>>>>> between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>>>>>
>>>>>>>>>> You're right that the API has been evolving in the 2.x line. But,
>>>>>>>>>> it is now reasonably stable with respect to the current feature set and we
>>>>>>>>>> should not need to break compatibility in the 3.x line. Because we have
>>>>>>>>>> reached our goals for the 3.0 release, we can backport at least those
>>>>>>>>>> features to 2.x and confidently have an API that works in both a 2.x
>>>>>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>>>>>
>>>>>>>>>> > I'd rather say preparation of Spark 2.5 should be started after
>>>>>>>>>> Spark 3.0 is officially released
>>>>>>>>>>
>>>>>>>>>> The reason I'm suggesting this is that I'm already going to do
>>>>>>>>>> the work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>>>>>> work, then I'd be happy to share that with the community.
>>>>>>>>>>
>>>>>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work
>>>>>>>>>> on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work
>>>>>>>>>> is about complete so we can easily release the same set of features and API
>>>>>>>>>> in 2.5 and 3.0.
>>>>>>>>>>
>>>>>>>>>> If we decide for some reason to wait until after 3.0 is released,
>>>>>>>>>> I don't know that there is much value in a 2.5. The purpose is to be a step
>>>>>>>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>>>>>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>>>>>>>> release probably would, given the work needed to validate the incompatible
>>>>>>>>>> changes in 3.0.
>>>>>>>>>>
>>>>>>>>>> > DSv2 change would be the major backward incompatibility which
>>>>>>>>>> Spark 2.x users may hesitate to upgrade
>>>>>>>>>>
>>>>>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this
>>>>>>>>>> is expected. I don't think it will need incompatible changes in the 3.x
>>>>>>>>>> line.
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I
>>>>>>>>>>> needed to deal with this as the change made confusion on my PRs...), but my
>>>>>>>>>>> bet is that DSv2 would be already changed in incompatible way, at least who
>>>>>>>>>>> works for custom DataSource. Making downstream to diverge their
>>>>>>>>>>> implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>>> a good experience - especially we are not completely closed the chance
>>>>>>>>>>> to further modify DSv2, and the change could be backward incompatible.
>>>>>>>>>>>
>>>>>>>>>>> If we really want to bring the DSv2 change to 2.x version line
>>>>>>>>>>> to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd
>>>>>>>>>>> rather say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>>>>>>> officially released, honestly even later than that, say, getting some
>>>>>>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>>>>>>> be frustrated to upgrade to next minor version.
>>>>>>>>>>>
>>>>>>>>>>> Btw, do we have any specific target users for this? Personally
>>>>>>>>>>> DSv2 change would be the major backward incompatibility which Spark 2.x
>>>>>>>>>>> users may hesitate to upgrade, so they might be already prepared to migrate
>>>>>>>>>>> to Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Do you mean you want to have a breaking API change between 3.0
>>>>>>>>>>>> and 3.1?
>>>>>>>>>>>> I believe we follow Semantic Versioning (
>>>>>>>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>>>>>>>
>>>>>>>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>>>>>>>
>>>>>>>>>>>> Bests,
>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <
>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>> stable version of InternalRow
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sounds like we agree, then. We will use it for 3.0, but there
>>>>>>>>>>>>> are known problems with it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will change
>>>>>>>>>>>>> and progress towards more stable, but will have to break certain APIs) and
>>>>>>>>>>>>> 2.x seems like a false premise.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’m only suggesting that we release the same support in a 2.5
>>>>>>>>>>>>> release that we do in 3.0. Since we are nearly finished with the 3.0 goals,
>>>>>>>>>>>>> it seems like we can certainly do that. We just won’t add any breaking
>>>>>>>>>>>>> changes before 3.1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <
>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>>> stable version of InternalRow, but thinking we'd have dsv2 working in both
>>>>>>>>>>>>>> 3.x (which will change and progress towards more stable, but will have to
>>>>>>>>>>>>>> break certain APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To point out some problems with InternalRow that you think
>>>>>>>>>>>>>> are already pragmatic and stable:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The class is in catalyst, which states:
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>> * Catalyst is a library for manipulating relational query
>>>>>>>>>>>>>> plans.  All classes in catalyst are
>>>>>>>>>>>>>> * considered an internal API to Spark SQL and are subject to
>>>>>>>>>>>>>> change between minor releases.
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no even any annotation on the interface.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The entire dependency chain were created to be private, and
>>>>>>>>>>>>>> tightly coupled with internal implementations. For example,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /**
>>>>>>>>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be
>>>>>>>>>>>>>> used for comparison,
>>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>>> * Note: This is not designed for general use cases, should
>>>>>>>>>>>>>> not be used outside SQL.
>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (which again is in catalyst package)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you want to argue this way, you might as well argue we
>>>>>>>>>>>>>> should make the entire catalyst package public to be pragmatic and not
>>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <
>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This isn’t quite accurate. The change I made was to use
>>>>>>>>>>>>>>> InternalRow instead of UnsafeRow, which is a specific
>>>>>>>>>>>>>>> implementation of InternalRow. Exposing this API has always
>>>>>>>>>>>>>>> been a part of DSv2 and while both you and I did some work to avoid this,
>>>>>>>>>>>>>>> we are still in the phase of starting with that API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note that any change to InternalRow would be very costly to
>>>>>>>>>>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>>>>>>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>>>>>>>>>>> UnsafeRow was part of the original proposal.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>>>>>>>>>>> InternalRow, it was to get the majority of SQL working on
>>>>>>>>>>>>>>> top of the interface added after 2.4. That’s done and stable, so I think a
>>>>>>>>>>>>>>> 2.5 release with it is also reasonable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To push back, while I agree we should not drastically
>>>>>>>>>>>>>>>> change "InternalRow", there are a lot of changes that need to happen to
>>>>>>>>>>>>>>>> make it stable. For example, none of the publicly exposed interfaces should
>>>>>>>>>>>>>>>> be in the Catalyst package or the unsafe package. External implementations
>>>>>>>>>>>>>>>> should be decoupled from the internal implementations, with cheap ways to
>>>>>>>>>>>>>>>> convert back and forth.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>>>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No, I think it is reasonably stable and very close to
>>>>>>>>>>>>>>>>> being ready for a release.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>>>>>>>>>>> completely ignored that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think what you're referring to is the use of
>>>>>>>>>>>>>>>>> `InternalRow`. That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>>> using it. In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>>> delay 3.0 until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While I understand the motivation for a better solution
>>>>>>>>>>>>>>>>> here, I think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x,
>>>>>>>>>>>>>>>>> that seems too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>>> needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I believe that those of us working on DSv2 are confident
>>>>>>>>>>>>>>>>> about the current stability. We set goals for what to get into the 3.0
>>>>>>>>>>>>>>>>> release months ago and have very nearly reached the point where we are
>>>>>>>>>>>>>>>>> ready for that release.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think instability would be a problem in
>>>>>>>>>>>>>>>>> maintaining compatibility between the 2.5 version and the 3.0 version. If
>>>>>>>>>>>>>>>>> we find that we need to make API changes (other than additions) then we can
>>>>>>>>>>>>>>>>> make those in the 3.1 release. Because the goals we set for the 3.0 release
>>>>>>>>>>>>>>>>> have been reached with the current API and if we are ready to release 3.0,
>>>>>>>>>>>>>>>>> we can release a 2.5 with the same API.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> DSv2 is far from stable right? All the actual data types
>>>>>>>>>>>>>>>>>> are unstable and you guys have completely ignored that. We'd need to work
>>>>>>>>>>>>>>>>>> on that and that will be a breaking change. If the goal is to make DSv2
>>>>>>>>>>>>>>>>>> work across 3.x and 2.x, that seems too invasive of a change to backport
>>>>>>>>>>>>>>>>>> once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible
>>>>>>>>>>>>>>>>>>> Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11
>>>>>>>>>>>>>>>>>>> support added.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> A Spark 2.5 release with these two additions will help
>>>>>>>>>>>>>>>>>>> people migrate to Spark 3.0 when it is released because they will be able
>>>>>>>>>>>>>>>>>>> to use a single implementation for DSv2 sources that works in both 2.5 and
>>>>>>>>>>>>>>>>>>> 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java
>>>>>>>>>>>>>>>>>>> 11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>>> fewer major changes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Another reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>>> people are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>>> DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>>> line, so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This release line would just consist of backports like
>>>>>>>>>>>>>>>>>>> DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>>> release small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>>> distract from the 3.0 release.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there
>>>>>>>>>>>>>>>>>>> any concerns about this plan?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSS] Spark 2.5 release

Posted by Holden Karau <ho...@pigscanfly.ca>.
I would personally love to see us provide a gentle migration path to Spark
3 especially if much of the work is already going to happen anyways.

Maybe giving it a different name (eg something like
Spark-2-to-3-transitional) would make it more clear about its intended
purpose and encourage folks to move to 3 when they can?

On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> My understanding is that 3.0-preview is not going to be a production-ready
> release. For those of us that have been using backports of DSv2 in
> production, that doesn't help.
>
> It also doesn't help as a stepping stone because users would need to
> handle all of the incompatible changes in 3.0. Using 3.0-preview would be
> an unstable release with breaking changes instead of a stable release
> without the breaking changes.
>
> I'm offering to help build a stable release without breaking changes. But
> if there is no community interest in it, I'm happy to drop this.
>
> On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
>> +1 for Matei's as well.
>>
>> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com> wrote:
>>
>>> I agree with Matei too.
>>>
>>> Thanks,
>>> Marco
>>>
>>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>>> dongjoon.hyun@gmail.com> ha scritto:
>>>
>>>> +1 for Matei's suggestion!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>>> if you already want to update your Java version.
>>>>>
>>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>> wrote:
>>>>>
>>>>> > If you insist we shouldn't change the unstable temporary API in 3.x
>>>>> . . .
>>>>>
>>>>> Not what I'm saying at all. I said we should carefully
>>>>> consider whether a breaking change is the right decision in the 3.x line.
>>>>>
>>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>>> and an API that is the same as the one in 3.0.
>>>>>
>>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>>
>>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>>> to share it with the community.
>>>>>
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Because for example we'd need to move the location of InternalRow,
>>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>>> different from my understanding of the situation when you exposed it, then
>>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>>
>>>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> Why would that require an incompatible change?
>>>>>>>
>>>>>>> We *could* make an incompatible change and remove support for
>>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>>> 3.0 compatible, which is the main goal.
>>>>>>>
>>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How would you not make incompatible changes in 3.x? As discussed
>>>>>>>> the InternalRow API is not stable and needs to change.
>>>>>>>>
>>>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> > Making downstream to diverge their implementation heavily
>>>>>>>>> between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>>>>
>>>>>>>>> You're right that the API has been evolving in the 2.x line. But,
>>>>>>>>> it is now reasonably stable with respect to the current feature set and we
>>>>>>>>> should not need to break compatibility in the 3.x line. Because we have
>>>>>>>>> reached our goals for the 3.0 release, we can backport at least those
>>>>>>>>> features to 2.x and confidently have an API that works in both a 2.x
>>>>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>>>>
>>>>>>>>> > I'd rather say preparation of Spark 2.5 should be started after
>>>>>>>>> Spark 3.0 is officially released
>>>>>>>>>
>>>>>>>>> The reason I'm suggesting this is that I'm already going to do the
>>>>>>>>> work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>>>>> work, then I'd be happy to share that with the community.
>>>>>>>>>
>>>>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on
>>>>>>>>> 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>>>>>>>> about complete so we can easily release the same set of features and API in
>>>>>>>>> 2.5 and 3.0.
>>>>>>>>>
>>>>>>>>> If we decide for some reason to wait until after 3.0 is released,
>>>>>>>>> I don't know that there is much value in a 2.5. The purpose is to be a step
>>>>>>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>>>>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>>>>>>> release probably would, given the work needed to validate the incompatible
>>>>>>>>> changes in 3.0.
>>>>>>>>>
>>>>>>>>> > DSv2 change would be the major backward incompatibility which
>>>>>>>>> Spark 2.x users may hesitate to upgrade
>>>>>>>>>
>>>>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this
>>>>>>>>> is expected. I don't think it will need incompatible changes in the 3.x
>>>>>>>>> line.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I
>>>>>>>>>> needed to deal with this as the change made confusion on my PRs...), but my
>>>>>>>>>> bet is that DSv2 would be already changed in incompatible way, at least who
>>>>>>>>>> works for custom DataSource. Making downstream to diverge their
>>>>>>>>>> implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be
>>>>>>>>>> a good experience - especially we are not completely closed the chance
>>>>>>>>>> to further modify DSv2, and the change could be backward incompatible.
>>>>>>>>>>
>>>>>>>>>> If we really want to bring the DSv2 change to 2.x version line to
>>>>>>>>>> let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd
>>>>>>>>>> rather say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>>>>>> officially released, honestly even later than that, say, getting some
>>>>>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>>>>>> be frustrated to upgrade to next minor version.
>>>>>>>>>>
>>>>>>>>>> Btw, do we have any specific target users for this? Personally
>>>>>>>>>> DSv2 change would be the major backward incompatibility which Spark 2.x
>>>>>>>>>> users may hesitate to upgrade, so they might be already prepared to migrate
>>>>>>>>>> to Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Do you mean you want to have a breaking API change between 3.0
>>>>>>>>>>> and 3.1?
>>>>>>>>>>> I believe we follow Semantic Versioning (
>>>>>>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>>>>>>
>>>>>>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <
>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don’t think we need to gate a 3.0 release on making a more
>>>>>>>>>>>> stable version of InternalRow
>>>>>>>>>>>>
>>>>>>>>>>>> Sounds like we agree, then. We will use it for 3.0, but there
>>>>>>>>>>>> are known problems with it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will change
>>>>>>>>>>>> and progress towards more stable, but will have to break certain APIs) and
>>>>>>>>>>>> 2.x seems like a false premise.
>>>>>>>>>>>>
>>>>>>>>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>>>>>>>>
>>>>>>>>>>>> I’m only suggesting that we release the same support in a 2.5
>>>>>>>>>>>> release that we do in 3.0. Since we are nearly finished with the 3.0 goals,
>>>>>>>>>>>> it seems like we can certainly do that. We just won’t add any breaking
>>>>>>>>>>>> changes before 3.1.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <
>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think we need to gate a 3.0 release on making a more
>>>>>>>>>>>>> stable version of InternalRow, but thinking we'd have dsv2 working in both
>>>>>>>>>>>>> 3.x (which will change and progress towards more stable, but will have to
>>>>>>>>>>>>> break certain APIs) and 2.x seems like a false premise.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To point out some problems with InternalRow that you think are
>>>>>>>>>>>>> already pragmatic and stable:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The class is in catalyst, which states:
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>>>>>>>>
>>>>>>>>>>>>> /**
>>>>>>>>>>>>> * Catalyst is a library for manipulating relational query
>>>>>>>>>>>>> plans.  All classes in catalyst are
>>>>>>>>>>>>> * considered an internal API to Spark SQL and are subject to
>>>>>>>>>>>>> change between minor releases.
>>>>>>>>>>>>> */
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no even any annotation on the interface.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The entire dependency chain were created to be private, and
>>>>>>>>>>>>> tightly coupled with internal implementations. For example,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>>>>>>>>
>>>>>>>>>>>>> /**
>>>>>>>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be
>>>>>>>>>>>>> used for comparison,
>>>>>>>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>>>>>>>>> * <p>
>>>>>>>>>>>>> * Note: This is not designed for general use cases, should not
>>>>>>>>>>>>> be used outside SQL.
>>>>>>>>>>>>> */
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>>>>>>>>
>>>>>>>>>>>>> (which again is in catalyst package)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you want to argue this way, you might as well argue we
>>>>>>>>>>>>> should make the entire catalyst package public to be pragmatic and not
>>>>>>>>>>>>> allow any changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rblue@netflix.com
>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This isn’t quite accurate. The change I made was to use
>>>>>>>>>>>>>> InternalRow instead of UnsafeRow, which is a specific
>>>>>>>>>>>>>> implementation of InternalRow. Exposing this API has always
>>>>>>>>>>>>>> been a part of DSv2 and while both you and I did some work to avoid this,
>>>>>>>>>>>>>> we are still in the phase of starting with that API.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that any change to InternalRow would be very costly to
>>>>>>>>>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>>>>>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>>>>>>>>>> UnsafeRow was part of the original proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>>>>>>>>>> InternalRow, it was to get the majority of SQL working on
>>>>>>>>>>>>>> top of the interface added after 2.4. That’s done and stable, so I think a
>>>>>>>>>>>>>> 2.5 release with it is also reasonable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <
>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To push back, while I agree we should not drastically change
>>>>>>>>>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>>>>>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>>>>>>>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>>>>>>>>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>>>>>>>>>>>> back and forth.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <
>>>>>>>>>>>>>>> rblue@netflix.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No, I think it is reasonably stable and very close to being
>>>>>>>>>>>>>>>> ready for a release.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>>>>>>>>>> completely ignored that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think what you're referring to is the use of
>>>>>>>>>>>>>>>> `InternalRow`. That's a stable API and there has been no work to avoid
>>>>>>>>>>>>>>>> using it. In any case, I don't think that anyone is suggesting that we
>>>>>>>>>>>>>>>> delay 3.0 until a replacement for `InternalRow` is added, right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> While I understand the motivation for a better solution
>>>>>>>>>>>>>>>> here, I think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that
>>>>>>>>>>>>>>>> seems too invasive of a change to backport once you consider the parts
>>>>>>>>>>>>>>>> needed to make dsv2 stable.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I believe that those of us working on DSv2 are confident
>>>>>>>>>>>>>>>> about the current stability. We set goals for what to get into the 3.0
>>>>>>>>>>>>>>>> release months ago and have very nearly reached the point where we are
>>>>>>>>>>>>>>>> ready for that release.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think instability would be a problem in maintaining
>>>>>>>>>>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>>>>>>>>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>>>>>>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>>>>>>>>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>>>>>>>>>>>> release a 2.5 with the same API.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <
>>>>>>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> DSv2 is far from stable right? All the actual data types
>>>>>>>>>>>>>>>>> are unstable and you guys have completely ignored that. We'd need to work
>>>>>>>>>>>>>>>>> on that and that will be a breaking change. If the goal is to make DSv2
>>>>>>>>>>>>>>>>> work across 3.x and 2.x, that seems too invasive of a change to backport
>>>>>>>>>>>>>>>>> once you consider the parts needed to make dsv2 stable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible
>>>>>>>>>>>>>>>>>> Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11
>>>>>>>>>>>>>>>>>> support added.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A Spark 2.5 release with these two additions will help
>>>>>>>>>>>>>>>>>> people migrate to Spark 3.0 when it is released because they will be able
>>>>>>>>>>>>>>>>>> to use a single implementation for DSv2 sources that works in both 2.5 and
>>>>>>>>>>>>>>>>>> 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java
>>>>>>>>>>>>>>>>>> 11 because users could update to Java 11 with the 2.5 release and have
>>>>>>>>>>>>>>>>>> fewer major changes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another reason to consider a 2.5 release is that many
>>>>>>>>>>>>>>>>>> people are interested in a release with the latest DSv2 API and support for
>>>>>>>>>>>>>>>>>> DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4
>>>>>>>>>>>>>>>>>> line, so it makes sense to share this work with the community.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This release line would just consist of backports like
>>>>>>>>>>>>>>>>>> DSv2 and Java 11 that assist compatibility, to keep the scope of the
>>>>>>>>>>>>>>>>>> release small. The purpose is to assist people moving to 3.0 and not
>>>>>>>>>>>>>>>>>> distract from the 3.0 release.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any
>>>>>>>>>>>>>>>>>> concerns about this plan?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Name : Jungtaek Lim
>>>>>>>>>> Blog : http://medium.com/@heartsavior
>>>>>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>>
>>>>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
My understanding is that 3.0-preview is not going to be a production-ready
release. For those of us that have been using backports of DSv2 in
production, that doesn't help.

It also doesn't help as a stepping stone because users would need to handle
all of the incompatible changes in 3.0. Using 3.0-preview would be an
unstable release with breaking changes instead of a stable release without
the breaking changes.

I'm offering to help build a stable release without breaking changes. But
if there is no community interest in it, I'm happy to drop this.

On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gu...@gmail.com> wrote:

> +1 for Matei's as well.
>
> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com> wrote:
>
>> I agree with Matei too.
>>
>> Thanks,
>> Marco
>>
>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>> dongjoon.hyun@gmail.com> ha scritto:
>>
>>> +1 for Matei's suggestion!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
>>> wrote:
>>>
>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>> if you already want to update your Java version.
>>>>
>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>> wrote:
>>>>
>>>> > If you insist we shouldn't change the unstable temporary API in 3.x .
>>>> . .
>>>>
>>>> Not what I'm saying at all. I said we should carefully consider whether
>>>> a breaking change is the right decision in the 3.x line.
>>>>
>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>> and an API that is the same as the one in 3.0.
>>>>
>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>
>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>> to share it with the community.
>>>>
>>>>
>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Because for example we'd need to move the location of InternalRow,
>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>> different from my understanding of the situation when you exposed it, then
>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>
>>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Why would that require an incompatible change?
>>>>>>
>>>>>> We *could* make an incompatible change and remove support for
>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>> 3.0 compatible, which is the main goal.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> How would you not make incompatible changes in 3.x? As discussed the
>>>>>> InternalRow API is not stable and needs to change.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> > Making downstream to diverge their implementation heavily between
>>>>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>
>>>>>> You're right that the API has been evolving in the 2.x line. But, it
>>>>>> is now reasonably stable with respect to the current feature set and we
>>>>>> should not need to break compatibility in the 3.x line. Because we have
>>>>>> reached our goals for the 3.0 release, we can backport at least those
>>>>>> features to 2.x and confidently have an API that works in both a 2.x
>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>>
>>>>>> > I'd rather say preparation of Spark 2.5 should be started after
>>>>>> Spark 3.0 is officially released
>>>>>>
>>>>>> The reason I'm suggesting this is that I'm already going to do the
>>>>>> work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>>> work, then I'd be happy to share that with the community.
>>>>>>
>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on
>>>>>> 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>>>>> about complete so we can easily release the same set of features and API in
>>>>>> 2.5 and 3.0.
>>>>>>
>>>>>> If we decide for some reason to wait until after 3.0 is released, I
>>>>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>>>> release probably would, given the work needed to validate the incompatible
>>>>>> changes in 3.0.
>>>>>>
>>>>>> > DSv2 change would be the major backward incompatibility which Spark
>>>>>> 2.x users may hesitate to upgrade
>>>>>>
>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>>>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed
>>>>>> to deal with this as the change made confusion on my PRs...), but my bet is
>>>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>>>> for custom DataSource. Making downstream to diverge their implementation
>>>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>>>> experience - especially we are not completely closed the chance to further
>>>>>> modify DSv2, and the change could be backward incompatible.
>>>>>>
>>>>>> If we really want to bring the DSv2 change to 2.x version line to let
>>>>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>>>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>> officially released, honestly even later than that, say, getting some
>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>>> be frustrated to upgrade to next minor version.
>>>>>>
>>>>>> Btw, do we have any specific target users for this? Personally DSv2
>>>>>> change would be the major backward incompatibility which Spark 2.x users
>>>>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>>>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>> Do you mean you want to have a breaking API change between 3.0 and
>>>>>> 3.1?
>>>>>> I believe we follow Semantic Versioning (
>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>
>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>>>> version of InternalRow
>>>>>>
>>>>>> Sounds like we agree, then. We will use it for 3.0, but there are
>>>>>> known problems with it.
>>>>>>
>>>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>>>> seems like a false premise.
>>>>>>
>>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>>
>>>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>>>> before 3.1.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>>> (which will change and progress towards more stable, but will have to break
>>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>>
>>>>>> To point out some problems with InternalRow that you think are
>>>>>> already pragmatic and stable:
>>>>>>
>>>>>> The class is in catalyst, which states:
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>
>>>>>> /**
>>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>>> classes in catalyst are
>>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>>> between minor releases.
>>>>>> */
>>>>>>
>>>>>> There is no even any annotation on the interface.
>>>>>>
>>>>>> The entire dependency chain were created to be private, and tightly
>>>>>> coupled with internal implementations. For example,
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>
>>>>>> /**
>>>>>> * A UTF-8 String for internal Spark use.
>>>>>> * <p>
>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>>> comparison,
>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>> * <p>
>>>>>> * Note: This is not designed for general use cases, should not be
>>>>>> used outside SQL.
>>>>>> */
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>
>>>>>> (which again is in catalyst package)
>>>>>>
>>>>>>
>>>>>> If you want to argue this way, you might as well argue we should make
>>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>>> changes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>> When you created the PR to make InternalRow public
>>>>>>
>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>>> of starting with that API.
>>>>>>
>>>>>> Note that any change to InternalRow would be very costly to
>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>> UnsafeRow was part of the original proposal.
>>>>>>
>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>>> with it is also reasonable.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> To push back, while I agree we should not drastically change
>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>>> back and forth.
>>>>>>
>>>>>> When you created the PR to make InternalRow public, the understanding
>>>>>> was to work towards making it stable in the future, assuming we will start
>>>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>>>> tightly coupled with other internal pieces public and stable and call it a
>>>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>>>> the rest of Spark doesn't change.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>> > DSv2 is far from stable right?
>>>>>>
>>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>>> a release.
>>>>>>
>>>>>> > All the actual data types are unstable and you guys have completely
>>>>>> ignored that.
>>>>>>
>>>>>> I think what you're referring to is the use of `InternalRow`. That's
>>>>>> a stable API and there has been no work to avoid using it. In any case, I
>>>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>>>> for `InternalRow` is added, right?
>>>>>>
>>>>>> While I understand the motivation for a better solution here, I think
>>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>>
>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>>> invasive of a change to backport once you consider the parts needed to make
>>>>>> dsv2 stable.
>>>>>>
>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>> release.
>>>>>>
>>>>>> I don't think instability would be a problem in maintaining
>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>> release a 2.5 with the same API.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>>>> the parts needed to make dsv2 stable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>>> added.
>>>>>>
>>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>>
>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>>> it makes sense to share this work with the community.
>>>>>>
>>>>>> This release line would just consist of backports like DSv2 and Java
>>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>> release.
>>>>>>
>>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>>> about this plan?
>>>>>>
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Name : Jungtaek Lim
>>>>>> Blog : http://medium.com/@heartsavior
>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Hyukjin Kwon <gu...@gmail.com>.
+1 for Matei's as well.

On Sun, 22 Sep 2019, 14:59 Marco Gaido, <ma...@gmail.com> wrote:

> I agree with Matei too.
>
> Thanks,
> Marco
>
> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
> dongjoon.hyun@gmail.com> ha scritto:
>
>> +1 for Matei's suggestion!
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>> sources, can we recommend the 3.0-preview release for this? That would get
>>> people shifting to 3.0 faster, which is probably better overall compared to
>>> maintaining two major versions. There’s not that much else changing in 3.0
>>> if you already want to update your Java version.
>>>
>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>> wrote:
>>>
>>> > If you insist we shouldn't change the unstable temporary API in 3.x .
>>> . .
>>>
>>> Not what I'm saying at all. I said we should carefully consider whether
>>> a breaking change is the right decision in the 3.x line.
>>>
>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>> and an API that is the same as the one in 3.0.
>>>
>>> > I also don't get this backporting a giant feature to 2.x line
>>>
>>> I am planning to do this so we can use DSv2 before 3.0 is released. Then
>>> we can have a source implementation that works in both 2.x and 3.0 to make
>>> the transition easier. Since I'm already doing the work, I'm offering to
>>> share it with the community.
>>>
>>>
>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Because for example we'd need to move the location of InternalRow,
>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>> different from my understanding of the situation when you exposed it, then
>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>
>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>
>>>>
>>>>
>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Why would that require an incompatible change?
>>>>>
>>>>> We *could* make an incompatible change and remove support for
>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>> 3.0 compatible, which is the main goal.
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> How would you not make incompatible changes in 3.x? As discussed the
>>>>> InternalRow API is not stable and needs to change.
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> > Making downstream to diverge their implementation heavily between
>>>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>
>>>>> You're right that the API has been evolving in the 2.x line. But, it
>>>>> is now reasonably stable with respect to the current feature set and we
>>>>> should not need to break compatibility in the 3.x line. Because we have
>>>>> reached our goals for the 3.0 release, we can backport at least those
>>>>> features to 2.x and confidently have an API that works in both a 2.x
>>>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>>>>
>>>>> > I'd rather say preparation of Spark 2.5 should be started after
>>>>> Spark 3.0 is officially released
>>>>>
>>>>> The reason I'm suggesting this is that I'm already going to do the
>>>>> work to backport the 3.0 release features to 2.4. I've been asked by
>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>> interest in making this available sooner than 3.0. If I'm already doing the
>>>>> work, then I'd be happy to share that with the community.
>>>>>
>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>>>> about complete so we can easily release the same set of features and API in
>>>>> 2.5 and 3.0.
>>>>>
>>>>> If we decide for some reason to wait until after 3.0 is released, I
>>>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>>> release probably would, given the work needed to validate the incompatible
>>>>> changes in 3.0.
>>>>>
>>>>> > DSv2 change would be the major backward incompatibility which Spark
>>>>> 2.x users may hesitate to upgrade
>>>>>
>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>>>> deal with this as the change made confusion on my PRs...), but my bet is
>>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>>> for custom DataSource. Making downstream to diverge their implementation
>>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>>> experience - especially we are not completely closed the chance to further
>>>>> modify DSv2, and the change could be backward incompatible.
>>>>>
>>>>> If we really want to bring the DSv2 change to 2.x version line to let
>>>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>> officially released, honestly even later than that, say, getting some
>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>>> be frustrated to upgrade to next minor version.
>>>>>
>>>>> Btw, do we have any specific target users for this? Personally DSv2
>>>>> change would be the major backward incompatibility which Spark 2.x users
>>>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>
>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>
>>>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>>>> I believe we follow Semantic Versioning (
>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>
>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow
>>>>>
>>>>> Sounds like we agree, then. We will use it for 3.0, but there are
>>>>> known problems with it.
>>>>>
>>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>>> seems like a false premise.
>>>>>
>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>
>>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>>> before 3.1.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>> (which will change and progress towards more stable, but will have to break
>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>
>>>>> To point out some problems with InternalRow that you think are already
>>>>> pragmatic and stable:
>>>>>
>>>>> The class is in catalyst, which states:
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>
>>>>> /**
>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>> classes in catalyst are
>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>> between minor releases.
>>>>> */
>>>>>
>>>>> There is no even any annotation on the interface.
>>>>>
>>>>> The entire dependency chain were created to be private, and tightly
>>>>> coupled with internal implementations. For example,
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>
>>>>> /**
>>>>> * A UTF-8 String for internal Spark use.
>>>>> * <p>
>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>> comparison,
>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>> * <p>
>>>>> * Note: This is not designed for general use cases, should not be used
>>>>> outside SQL.
>>>>> */
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>
>>>>> (which again is in catalyst package)
>>>>>
>>>>>
>>>>> If you want to argue this way, you might as well argue we should make
>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> When you created the PR to make InternalRow public
>>>>>
>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>> of starting with that API.
>>>>>
>>>>> Note that any change to InternalRow would be very costly to implement
>>>>> because this interface is widely used. That is why I think we can certainly
>>>>> consider it stable enough to use here, and that’s probably why
>>>>> UnsafeRow was part of the original proposal.
>>>>>
>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>> with it is also reasonable.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> To push back, while I agree we should not drastically change
>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>> back and forth.
>>>>>
>>>>> When you created the PR to make InternalRow public, the understanding
>>>>> was to work towards making it stable in the future, assuming we will start
>>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>>> tightly coupled with other internal pieces public and stable and call it a
>>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>>> the rest of Spark doesn't change.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> > DSv2 is far from stable right?
>>>>>
>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>> a release.
>>>>>
>>>>> > All the actual data types are unstable and you guys have completely
>>>>> ignored that.
>>>>>
>>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>>> stable API and there has been no work to avoid using it. In any case, I
>>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>>> for `InternalRow` is added, right?
>>>>>
>>>>> While I understand the motivation for a better solution here, I think
>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>
>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>> invasive of a change to backport once you consider the parts needed to make
>>>>> dsv2 stable.
>>>>>
>>>>> I believe that those of us working on DSv2 are confident about the
>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>> ago and have very nearly reached the point where we are ready for that
>>>>> release.
>>>>>
>>>>> I don't think instability would be a problem in maintaining
>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>> we need to make API changes (other than additions) then we can make those
>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>> release a 2.5 with the same API.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>>> the parts needed to make dsv2 stable.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rblue@netflix.com.invalid
>>>>> > wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>> added.
>>>>>
>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>
>>>>> Another reason to consider a 2.5 release is that many people are
>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>> it makes sense to share this work with the community.
>>>>>
>>>>> This release line would just consist of backports like DSv2 and Java
>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>> release.
>>>>>
>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>> about this plan?
>>>>>
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Name : Jungtaek Lim
>>>>> Blog : http://medium.com/@heartsavior
>>>>> Twitter : http://twitter.com/heartsavior
>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>

Re: [DISCUSS] Spark 2.5 release

Posted by Marco Gaido <ma...@gmail.com>.
I agree with Matei too.

Thanks,
Marco

Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
dongjoon.hyun@gmail.com> ha scritto:

> +1 for Matei's suggestion!
>
> Bests,
> Dongjoon.
>
> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> If the goal is to get people to try the DSv2 API and build DSv2 data
>> sources, can we recommend the 3.0-preview release for this? That would get
>> people shifting to 3.0 faster, which is probably better overall compared to
>> maintaining two major versions. There’s not that much else changing in 3.0
>> if you already want to update your Java version.
>>
>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>> > If you insist we shouldn't change the unstable temporary API in 3.x . .
>> .
>>
>> Not what I'm saying at all. I said we should carefully consider whether a
>> breaking change is the right decision in the 3.x line.
>>
>> All I'm suggesting is that we can make a 2.5 release with the feature and
>> an API that is the same as the one in 3.0.
>>
>> > I also don't get this backporting a giant feature to 2.x line
>>
>> I am planning to do this so we can use DSv2 before 3.0 is released. Then
>> we can have a source implementation that works in both 2.x and 3.0 to make
>> the transition easier. Since I'm already doing the work, I'm offering to
>> share it with the community.
>>
>>
>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Because for example we'd need to move the location of InternalRow,
>>> breaking the package name. If you insist we shouldn't change the unstable
>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>> different from my understanding of the situation when you exposed it, then
>>> I'd say we should gate 3.0 on having a stable row interface.
>>>
>>> I also don't get this backporting a giant feature to 2.x line ... as
>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>
>>>
>>>
>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Why would that require an incompatible change?
>>>>
>>>> We *could* make an incompatible change and remove support for
>>>> InternalRow, but I think we would want to carefully consider whether that
>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>> 3.0 compatible, which is the main goal.
>>>>
>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> How would you not make incompatible changes in 3.x? As discussed the
>>>> InternalRow API is not stable and needs to change.
>>>>
>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> > Making downstream to diverge their implementation heavily between
>>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>
>>>> You're right that the API has been evolving in the 2.x line. But, it is
>>>> now reasonably stable with respect to the current feature set and we should
>>>> not need to break compatibility in the 3.x line. Because we have reached
>>>> our goals for the 3.0 release, we can backport at least those features to
>>>> 2.x and confidently have an API that works in both a 2.x release and is
>>>> compatible with 3.0, if not 3.1 and later releases as well.
>>>>
>>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>>> 3.0 is officially released
>>>>
>>>> The reason I'm suggesting this is that I'm already going to do the work
>>>> to backport the 3.0 release features to 2.4. I've been asked by several
>>>> people when DSv2 will be released, so I know there is a lot of interest in
>>>> making this available sooner than 3.0. If I'm already doing the work, then
>>>> I'd be happy to share that with the community.
>>>>
>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>>> about complete so we can easily release the same set of features and API in
>>>> 2.5 and 3.0.
>>>>
>>>> If we decide for some reason to wait until after 3.0 is released, I
>>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>> release probably would, given the work needed to validate the incompatible
>>>> changes in 3.0.
>>>>
>>>> > DSv2 change would be the major backward incompatibility which Spark
>>>> 2.x users may hesitate to upgrade
>>>>
>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>>
>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>>>
>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>>> deal with this as the change made confusion on my PRs...), but my bet is
>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>> for custom DataSource. Making downstream to diverge their implementation
>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>> experience - especially we are not completely closed the chance to further
>>>> modify DSv2, and the change could be backward incompatible.
>>>>
>>>> If we really want to bring the DSv2 change to 2.x version line to let
>>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>> officially released, honestly even later than that, say, getting some
>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>>> be frustrated to upgrade to next minor version.
>>>>
>>>> Btw, do we have any specific target users for this? Personally DSv2
>>>> change would be the major backward incompatibility which Spark 2.x users
>>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>
>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>>> I believe we follow Semantic Versioning (
>>>> https://spark.apache.org/versioning-policy.html ).
>>>>
>>>> > We just won’t add any breaking changes before 3.1.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>> before 3.1.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>> (which will change and progress towards more stable, but will have to break
>>>> certain APIs) and 2.x seems like a false premise.
>>>>
>>>> To point out some problems with InternalRow that you think are already
>>>> pragmatic and stable:
>>>>
>>>> The class is in catalyst, which states:
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>
>>>> /**
>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>> classes in catalyst are
>>>> * considered an internal API to Spark SQL and are subject to change
>>>> between minor releases.
>>>> */
>>>>
>>>> There is no even any annotation on the interface.
>>>>
>>>> The entire dependency chain were created to be private, and tightly
>>>> coupled with internal implementations. For example,
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>
>>>> /**
>>>> * A UTF-8 String for internal Spark use.
>>>> * <p>
>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>> comparison,
>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>> * <p>
>>>> * Note: This is not designed for general use cases, should not be used
>>>> outside SQL.
>>>> */
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>
>>>> (which again is in catalyst package)
>>>>
>>>>
>>>> If you want to argue this way, you might as well argue we should make
>>>> the entire catalyst package public to be pragmatic and not allow any
>>>> changes.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> When you created the PR to make InternalRow public
>>>>
>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>>> Exposing this API has always been a part of DSv2 and while both you and I
>>>> did some work to avoid this, we are still in the phase of starting with
>>>> that API.
>>>>
>>>> Note that any change to InternalRow would be very costly to implement
>>>> because this interface is widely used. That is why I think we can certainly
>>>> consider it stable enough to use here, and that’s probably why
>>>> UnsafeRow was part of the original proposal.
>>>>
>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>>> it was to get the majority of SQL working on top of the interface added
>>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>>> reasonable.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> To push back, while I agree we should not drastically change
>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>> the Catalyst package or the unsafe package. External implementations should
>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>> back and forth.
>>>>
>>>> When you created the PR to make InternalRow public, the understanding
>>>> was to work towards making it stable in the future, assuming we will start
>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>> tightly coupled with other internal pieces public and stable and call it a
>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>> the rest of Spark doesn't change.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> > DSv2 is far from stable right?
>>>>
>>>> No, I think it is reasonably stable and very close to being ready for a
>>>> release.
>>>>
>>>> > All the actual data types are unstable and you guys have completely
>>>> ignored that.
>>>>
>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>> stable API and there has been no work to avoid using it. In any case, I
>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>> for `InternalRow` is added, right?
>>>>
>>>> While I understand the motivation for a better solution here, I think
>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>
>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>> invasive of a change to backport once you consider the parts needed to make
>>>> dsv2 stable.
>>>>
>>>> I believe that those of us working on DSv2 are confident about the
>>>> current stability. We set goals for what to get into the 3.0 release months
>>>> ago and have very nearly reached the point where we are ready for that
>>>> release.
>>>>
>>>> I don't think instability would be a problem in maintaining
>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>> we need to make API changes (other than additions) then we can make those
>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>> reached with the current API and if we are ready to release 3.0, we can
>>>> release a 2.5 with the same API.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>> and you guys have completely ignored that. We'd need to work on that and
>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>> the parts needed to make dsv2 stable.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>> added.
>>>>
>>>> A Spark 2.5 release with these two additions will help people migrate
>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>
>>>> Another reason to consider a 2.5 release is that many people are
>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>> it makes sense to share this work with the community.
>>>>
>>>> This release line would just consist of backports like DSv2 and Java 11
>>>> that assist compatibility, to keep the scope of the release small. The
>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>> release.
>>>>
>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>> about this plan?
>>>>
>>>>
>>>> rb
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>

Re: [DISCUSS] Spark 2.5 release

Posted by Dongjoon Hyun <do...@gmail.com>.
+1 for Matei's suggestion!

Bests,
Dongjoon.

On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <ma...@gmail.com>
wrote:

> If the goal is to get people to try the DSv2 API and build DSv2 data
> sources, can we recommend the 3.0-preview release for this? That would get
> people shifting to 3.0 faster, which is probably better overall compared to
> maintaining two major versions. There’s not that much else changing in 3.0
> if you already want to update your Java version.
>
> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> > If you insist we shouldn't change the unstable temporary API in 3.x . . .
>
> Not what I'm saying at all. I said we should carefully consider whether a
> breaking change is the right decision in the 3.x line.
>
> All I'm suggesting is that we can make a 2.5 release with the feature and
> an API that is the same as the one in 3.0.
>
> > I also don't get this backporting a giant feature to 2.x line
>
> I am planning to do this so we can use DSv2 before 3.0 is released. Then
> we can have a source implementation that works in both 2.x and 3.0 to make
> the transition easier. Since I'm already doing the work, I'm offering to
> share it with the community.
>
>
> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Because for example we'd need to move the location of InternalRow,
>> breaking the package name. If you insist we shouldn't change the unstable
>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>> different from my understanding of the situation when you exposed it, then
>> I'd say we should gate 3.0 on having a stable row interface.
>>
>> I also don't get this backporting a giant feature to 2.x line ... as
>> suggested by others in the thread, DSv2 would be one of the main reasons
>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>
>>
>>
>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Why would that require an incompatible change?
>>>
>>> We *could* make an incompatible change and remove support for
>>> InternalRow, but I think we would want to carefully consider whether that
>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>> 3.0 compatible, which is the main goal.
>>>
>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>> How would you not make incompatible changes in 3.x? As discussed the
>>> InternalRow API is not stable and needs to change.
>>>
>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> > Making downstream to diverge their implementation heavily between
>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we should
>>> not need to break compatibility in the 3.x line. Because we have reached
>>> our goals for the 3.0 release, we can backport at least those features to
>>> 2.x and confidently have an API that works in both a 2.x release and is
>>> compatible with 3.0, if not 3.1 and later releases as well.
>>>
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>>
>>> The reason I'm suggesting this is that I'm already going to do the work
>>> to backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>>
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API in
>>> 2.5 and 3.0.
>>>
>>> If we decide for some reason to wait until after 3.0 is released, I
>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>> release probably would, given the work needed to validate the incompatible
>>> changes in 3.0.
>>>
>>> > DSv2 change would be the major backward incompatibility which Spark
>>> 2.x users may hesitate to upgrade
>>>
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience - especially we are not completely closed the chance to further
>>> modify DSv2, and the change could be backward incompatible.
>>>
>>> If we really want to bring the DSv2 change to 2.x version line to let
>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>> officially released, honestly even later than that, say, getting some
>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>> be frustrated to upgrade to next minor version.
>>>
>>> Btw, do we have any specific target users for this? Personally DSv2
>>> change would be the major backward incompatibility which Spark 2.x users
>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>
>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * <p>
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * <p>
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> When you created the PR to make InternalRow public
>>>
>>> This isn’t quite accurate. The change I made was to use InternalRow
>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>> Exposing this API has always been a part of DSv2 and while both you and I
>>> did some work to avoid this, we are still in the phase of starting with
>>> that API.
>>>
>>> Note that any change to InternalRow would be very costly to implement
>>> because this interface is widely used. That is why I think we can certainly
>>> consider it stable enough to use here, and that’s probably why UnsafeRow
>>> was part of the original proposal.
>>>
>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>> it was to get the majority of SQL working on top of the interface added
>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>> reasonable.
>>>
>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> To push back, while I agree we should not drastically change
>>> "InternalRow", there are a lot of changes that need to happen to make it
>>> stable. For example, none of the publicly exposed interfaces should be in
>>> the Catalyst package or the unsafe package. External implementations should
>>> be decoupled from the internal implementations, with cheap ways to convert
>>> back and forth.
>>>
>>> When you created the PR to make InternalRow public, the understanding
>>> was to work towards making it stable in the future, assuming we will start
>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>> tightly coupled with other internal pieces public and stable and call it a
>>> day, just because it happen to satisfy some use cases temporarily assuming
>>> the rest of Spark doesn't change.
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> > DSv2 is far from stable right?
>>>
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>>
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>>
>>> I think what you're referring to is the use of `InternalRow`. That's a
>>> stable API and there has been no work to avoid using it. In any case, I
>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>> for `InternalRow` is added, right?
>>>
>>> While I understand the motivation for a better solution here, I think
>>> the pragmatic solution is to continue using `InternalRow`.
>>>
>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>> invasive of a change to backport once you consider the parts needed to make
>>> dsv2 stable.
>>>
>>> I believe that those of us working on DSv2 are confident about the
>>> current stability. We set goals for what to get into the 3.0 release months
>>> ago and have very nearly reached the point where we are ready for that
>>> release.
>>>
>>> I don't think instability would be a problem in maintaining
>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>> we need to make API changes (other than additions) then we can make those
>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>> reached with the current API and if we are ready to release 3.0, we can
>>> release a 2.5 with the same API.
>>>
>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> DSv2 is far from stable right? All the actual data types are unstable
>>> and you guys have completely ignored that. We'd need to work on that and
>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>> and 2.x, that seems too invasive of a change to backport once you consider
>>> the parts needed to make dsv2 stable.
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>>
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>>
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>>
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>>
>>>
>>> rb
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

Re: [DISCUSS] Spark 2.5 release

Posted by Matei Zaharia <ma...@gmail.com>.
If the goal is to get people to try the DSv2 API and build DSv2 data sources, can we recommend the 3.0-preview release for this? That would get people shifting to 3.0 faster, which is probably better overall compared to maintaining two major versions. There’s not that much else changing in 3.0 if you already want to update your Java version.

> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> > If you insist we shouldn't change the unstable temporary API in 3.x . . .
> 
> Not what I'm saying at all. I said we should carefully consider whether a breaking change is the right decision in the 3.x line.
> 
> All I'm suggesting is that we can make a 2.5 release with the feature and an API that is the same as the one in 3.0.
> 
> > I also don't get this backporting a giant feature to 2.x line
> 
> I am planning to do this so we can use DSv2 before 3.0 is released. Then we can have a source implementation that works in both 2.x and 3.0 to make the transition easier. Since I'm already doing the work, I'm offering to share it with the community.
> 
> 
> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd say we should gate 3.0 on having a stable row interface.
> 
> I also don't get this backporting a giant feature to 2.x line ... as suggested by others in the thread, DSv2 would be one of the main reasons people upgrade to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 3.0 entirely and backport all the features to 2.x?
> 
> 
> 
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> Why would that require an incompatible change?
> 
> We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. 
> 
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> > Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
> 
> You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break compatibility in the 3.x line. Because we have reached our goals for the 3.0 release, we can backport at least those features to 2.x and confidently have an API that works in both a 2.x release and is compatible with 3.0, if not 3.1 and later releases as well.
> 
> > I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released
> 
> The reason I'm suggesting this is that I'm already going to do the work to backport the 3.0 release features to 2.4. I've been asked by several people when DSv2 will be released, so I know there is a lot of interest in making this available sooner than 3.0. If I'm already doing the work, then I'd be happy to share that with the community.
> 
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work is about complete so we can easily release the same set of features and API in 2.5 and 3.0.
> 
> If we decide for some reason to wait until after 3.0 is released, I don't know that there is much value in a 2.5. The purpose is to be a step toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also wouldn't get these features out any sooner than 3.0, as a 2.5 release probably would, given the work needed to validate the incompatible changes in 3.0.
> 
> > DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade
> 
> As I pointed out, DSv2 has been changing in the 2.x line, so this is expected. I don't think it will need incompatible changes in the 3.x line.
> 
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <kabhwan@gmail.com <ma...@gmail.com>> wrote:
> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially we are not completely closed the chance to further modify DSv2, and the change could be backward incompatible.
> 
> If we really want to bring the DSv2 change to 2.x version line to let end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 is officially released, honestly even later than that, say, getting some reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to upgrade to next minor version.
> 
> Btw, do we have any specific target users for this? Personally DSv2 change would be the major backward incompatibility which Spark 2.x users may hesitate to upgrade, so they might be already prepared to migrate to Spark 3.0 if they are prepared to migrate to new DSv2.
> 
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html <https://spark.apache.org/versioning-policy.html> ).
> 
> > We just won’t add any breaking changes before 3.1.
> 
> Bests,
> Dongjoon.
> 
> 
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
> I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow
> 
> Sounds like we agree, then. We will use it for 3.0, but there are known problems with it.
> 
> Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.
> 
> Why do you think we will need to break certain APIs before 3.0?
> 
> I’m only suggesting that we release the same support in a 2.5 release that we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems like we can certainly do that. We just won’t add any breaking changes before 3.1.
> 
> 
> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.
> 
> To point out some problems with InternalRow that you think are already pragmatic and stable:
> 
> The class is in catalyst, which states: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala>
> 
> /**
> * Catalyst is a library for manipulating relational query plans.  All classes in catalyst are
> * considered an internal API to Spark SQL and are subject to change between minor releases.
> */
> 
> There is no even any annotation on the interface.
> 
> The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 
> 
> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java <https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java>
> 
> /**
> * A UTF-8 String for internal Spark use.
> * <p>
> * A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,
> * search, see http://en.wikipedia.org/wiki/UTF-8 <http://en.wikipedia.org/wiki/UTF-8> for details.
> * <p>
> * Note: This is not designed for general use cases, should not be used outside SQL.
> */
> 
> 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala>
> 
> (which again is in catalyst package)
> 
> 
> If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.
> 
> 
> 
> 
> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> When you created the PR to make InternalRow public
> 
> This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we are still in the phase of starting with that API.
> 
> Note that any change to InternalRow would be very costly to implement because this interface is widely used. That is why I think we can certainly consider it stable enough to use here, and that’s probably why UnsafeRow was part of the original proposal.
> 
> In any case, the goal for 3.0 was not to replace the use of InternalRow, it was to get the majority of SQL working on top of the interface added after 2.4. That’s done and stable, so I think a 2.5 release with it is also reasonable.
> 
> 
> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.
> 
> When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.
> 
> 
> 
> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> > DSv2 is far from stable right?
> 
> No, I think it is reasonably stable and very close to being ready for a release.
> 
> > All the actual data types are unstable and you guys have completely ignored that.
> 
> I think what you're referring to is the use of `InternalRow`. That's a stable API and there has been no work to avoid using it. In any case, I don't think that anyone is suggesting that we delay 3.0 until a replacement for `InternalRow` is added, right?
> 
> While I understand the motivation for a better solution here, I think the pragmatic solution is to continue using `InternalRow`.
> 
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.
> 
> I believe that those of us working on DSv2 are confident about the current stability. We set goals for what to get into the 3.0 release months ago and have very nearly reached the point where we are ready for that release.
> 
> I don't think instability would be a problem in maintaining compatibility between the 2.5 version and the 3.0 version. If we find that we need to make API changes (other than additions) then we can make those in the 3.1 release. Because the goals we set for the 3.0 release have been reached with the current API and if we are ready to release 3.0, we can release a 2.5 with the same API.
> 
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.
> 
> 
> 
> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
> Hi everyone,
> 
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> 
> A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a single implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, upgrading to 3.0 won't also require also updating to Java 11 because users could update to Java 11 with the 2.5 release and have fewer major changes.
> 
> Another reason to consider a 2.5 release is that many people are interested in a release with the latest DSv2 API and support for DSv2 SQL. I'm already going to be backporting DSv2 support to the Spark 2.4 line, so it makes sense to share this work with the community.
> 
> This release line would just consist of backports like DSv2 and Java 11 that assist compatibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release.
> 
> Would a Spark 2.5 release help anyone else? Are there any concerns about this plan?
> 
> 
> rb
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior <ht...@heartsavior>
> Twitter : http://twitter.com/heartsavior <http://twitter.com/heartsavior>
> LinkedIn : http://www.linkedin.com/in/heartsavior <http://www.linkedin.com/in/heartsavior>
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
> If you insist we shouldn't change the unstable temporary API in 3.x . . .

Not what I'm saying at all. I said we should carefully consider whether a
breaking change is the right decision in the 3.x line.

All I'm suggesting is that we can make a 2.5 release with the feature and
an API that is the same as the one in 3.0.

> I also don't get this backporting a giant feature to 2.x line

I am planning to do this so we can use DSv2 before 3.0 is released. Then we
can have a source implementation that works in both 2.x and 3.0 to make the
transition easier. Since I'm already doing the work, I'm offering to share
it with the community.


On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <rx...@databricks.com> wrote:

> Because for example we'd need to move the location of InternalRow,
> breaking the package name. If you insist we shouldn't change the unstable
> temporary API in 3.x to maintain compatibility with 3.0, which is totally
> different from my understanding of the situation when you exposed it, then
> I'd say we should gate 3.0 on having a stable row interface.
>
> I also don't get this backporting a giant feature to 2.x line ... as
> suggested by others in the thread, DSv2 would be one of the main reasons
> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>
>
>
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>
>> Why would that require an incompatible change?
>>
>> We *could* make an incompatible change and remove support for
>> InternalRow, but I think we would want to carefully consider whether that
>> is the right decision. And in any case, we would be able to keep 2.5 and
>> 3.0 compatible, which is the main goal.
>>
>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change.
>>
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> I don't think we need to gate a 3.0 release on making a more stable
>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>> (which will change and progress towards more stable, but will have to break
>> certain APIs) and 2.x seems like a false premise.
>>
>> To point out some problems with InternalRow that you think are already
>> pragmatic and stable:
>>
>> The class is in catalyst, which states:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>
>> /**
>> * Catalyst is a library for manipulating relational query plans.  All
>> classes in catalyst are
>> * considered an internal API to Spark SQL and are subject to change
>> between minor releases.
>> */
>>
>> There is no even any annotation on the interface.
>>
>> The entire dependency chain were created to be private, and tightly
>> coupled with internal implementations. For example,
>>
>>
>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>
>> /**
>> * A UTF-8 String for internal Spark use.
>> * <p>
>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>> comparison,
>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>> * <p>
>> * Note: This is not designed for general use cases, should not be used
>> outside SQL.
>> */
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>
>> (which again is in catalyst package)
>>
>>
>> If you want to argue this way, you might as well argue we should make the
>> entire catalyst package public to be pragmatic and not allow any changes.
>>
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> When you created the PR to make InternalRow public
>>
>> This isn’t quite accurate. The change I made was to use InternalRow
>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>> Exposing this API has always been a part of DSv2 and while both you and I
>> did some work to avoid this, we are still in the phase of starting with
>> that API.
>>
>> Note that any change to InternalRow would be very costly to implement
>> because this interface is widely used. That is why I think we can certainly
>> consider it stable enough to use here, and that’s probably why UnsafeRow
>> was part of the original proposal.
>>
>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>> it was to get the majority of SQL working on top of the interface added
>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>> reasonable.
>>
>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations should
>> be decoupled from the internal implementations, with cheap ways to convert
>> back and forth.
>>
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start with
>> an unstable API temporarily. You can't just make a bunch internal APIs
>> tightly coupled with other internal pieces public and stable and call it a
>> day, just because it happen to satisfy some use cases temporarily assuming
>> the rest of Spark doesn't change.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to is the use of `InternalRow`. That's a
>> stable API and there has been no work to avoid using it. In any case, I
>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>> for `InternalRow` is added, right?
>>
>> While I understand the motivation for a better solution here, I think the
>> pragmatic solution is to continue using `InternalRow`.
>>
>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>> invasive of a change to backport once you consider the parts needed to make
>> dsv2 stable.
>>
>> I believe that those of us working on DSv2 are confident about the
>> current stability. We set goals for what to get into the 3.0 release months
>> ago and have very nearly reached the point where we are ready for that
>> release.
>>
>> I don't think instability would be a problem in maintaining compatibility
>> between the 2.5 version and the 3.0 version. If we find that we need to
>> make API changes (other than additions) then we can make those in the 3.1
>> release. Because the goals we set for the 3.0 release have been reached
>> with the current API and if we are ready to release 3.0, we can release a
>> 2.5 with the same API.
>>
>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release line would just consist of backports like DSv2 and Java 11
>> that assist compatibility, to keep the scope of the release small. The
>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>> release.
>>
>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>> this plan?
>>
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Reynold Xin <rx...@databricks.com>.
Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd say we should gate 3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested by others in the thread, DSv2 would be one of the main reasons people upgrade to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 3.0 entirely and backport all the features to 2.x?

On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue < rblue@netflix.com > wrote:

> 
> Why would that require an incompatible change?
> 
> 
> We *could* make an incompatible change and remove support for InternalRow,
> but I think we would want to carefully consider whether that is the right
> decision. And in any case, we would be able to keep 2.5 and 3.0
> compatible, which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < rxin@databricks.com > wrote:
> 
> 
> 
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change. 
>> 
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue < rblue@netflix.com > wrote:
>> 
>> 
>>> > Making downstream to diverge their implementation heavily between minor
>>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>> 
>>> 
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we
>>> should not need to break compatibility in the 3.x line. Because we have
>>> reached our goals for the 3.0 release, we can backport at least those
>>> features to 2.x and confidently have an API that works in both a 2.x
>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>> 
>>> 
>>> 
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>> 
>>> 
>>> The reason I'm suggesting this is that I'm already going to do the work to
>>> backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>> 
>>> 
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API
>>> in 2.5 and 3.0.
>>> 
>>> 
>>> If we decide for some reason to wait until after 3.0 is released, I don't
>>> know that there is much value in a 2.5. The purpose is to be a step toward
>>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>>> probably would, given the work needed to validate the incompatible changes
>>> in 3.0.
>>> 
>>> 
>>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>>> users may hesitate to upgrade
>>> 
>>> 
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim < kabhwan@gmail.com > wrote:
>>> 
>>> 
>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>>> deal with this as the change made confusion on my PRs...), but my bet is
>>>> that DSv2 would be already changed in incompatible way, at least who works
>>>> for custom DataSource. Making downstream to diverge their implementation
>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>> experience - especially we are not completely closed the chance to further
>>>> modify DSv2, and the change could be backward incompatible.
>>>> 
>>>> 
>>>> If we really want to bring the DSv2 change to 2.x version line to let end
>>>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>>>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>>>> released, honestly even later than that, say, getting some reports from
>>>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make
>>>> Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be
>>>> frustrated to upgrade to next minor version.
>>>> 
>>>> 
>>>> Btw, do we have any specific target users for this? Personally DSv2 change
>>>> would be the major backward incompatibility which Spark 2.x users may
>>>> hesitate to upgrade, so they might be already prepared to migrate to Spark
>>>> 3.0 if they are prepared to migrate to new DSv2.
>>>> 
>>>> 
>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun < dongjoon.hyun@gmail.com >
>>>> wrote:
>>>> 
>>>> 
>>>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>>>> I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html
>>>>> ).
>>>>> 
>>>>> > We just won’t add any breaking changes before 3.1.
>>>>> 
>>>>> 
>>>>> Bests,
>>>>> Dongjoon.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue < rblue@netflix.com.invalid >
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>>>>> version of InternalRow
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>>>> problems with it.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>>>>> seems like a false premise.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I’m only suggesting that we release the same support in a 2.5 release that
>>>>>> we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems
>>>>>> like we can certainly do that. We just won’t add any breaking changes
>>>>>> before 3.1.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin < rxin@databricks.com > wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>>>> (which will change and progress towards more stable, but will have to
>>>>>>> break certain APIs) and 2.x seems like a false premise.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> To point out some problems with InternalRow that you think are already
>>>>>>> pragmatic and stable:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The class is in catalyst, which states: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> /**
>>>>>>> 
>>>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>>>> classes in catalyst are
>>>>>>> 
>>>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>>>> between minor releases.
>>>>>>> 
>>>>>>> */
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> There is no even any annotation on the interface.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The entire dependency chain were created to be private, and tightly
>>>>>>> coupled with internal implementations. For example, 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> /**
>>>>>>> 
>>>>>>> * A UTF-8 String for internal Spark use.
>>>>>>> 
>>>>>>> * <p>
>>>>>>> 
>>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>>>> comparison,
>>>>>>> 
>>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>>> 
>>>>>>> * <p>
>>>>>>> 
>>>>>>> * Note: This is not designed for general use cases, should not be used
>>>>>>> outside SQL.
>>>>>>> 
>>>>>>> */
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> (which again is in catalyst package)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If you want to argue this way, you might as well argue we should make the
>>>>>>> entire catalyst package public to be pragmatic and not allow any changes.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue < rblue@netflix.com > wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> When you created the PR to make InternalRow public
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This isn’t quite accurate. The change I made was to use InternalRow instead
>>>>>>>> of UnsafeRow , which is a specific implementation of InternalRow. Exposing
>>>>>>>> this API has always been a part of DSv2 and while both you and I did some
>>>>>>>> work to avoid this, we are still in the phase of starting with that API.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Note that any change to InternalRow would be very costly to implement
>>>>>>>> because this interface is widely used. That is why I think we can
>>>>>>>> certainly consider it stable enough to use here, and that’s probably why UnsafeRow
>>>>>>>> was part of the original proposal.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow ,
>>>>>>>> it was to get the majority of SQL working on top of the interface added
>>>>>>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is
>>>>>>>> also reasonable.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin < rxin@databricks.com > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> To push back, while I agree we should not drastically change
>>>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>>>>> the Catalyst package or the unsafe package. External implementations
>>>>>>>>> should be decoupled from the internal implementations, with cheap ways to
>>>>>>>>> convert back and forth.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> When you created the PR to make InternalRow public, the understanding was
>>>>>>>>> to work towards making it stable in the future, assuming we will start
>>>>>>>>> with an unstable API temporarily. You can't just make a bunch internal
>>>>>>>>> APIs tightly coupled with other internal pieces public and stable and call
>>>>>>>>> it a day, just because it happen to satisfy some use cases temporarily
>>>>>>>>> assuming the rest of Spark doesn't change.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rblue@netflix.com > wrote:
>>>>>>>>> 
>>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> No, I think it is reasonably stable and very close to being ready for a
>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> > All the actual data types are unstable and you guys have completely
>>>>>>>>>> ignored that.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>>>>>>>> stable API and there has been no work to avoid using it. In any case, I
>>>>>>>>>> don't think that anyone is suggesting that we delay 3.0 until a
>>>>>>>>>> replacement for `InternalRow` is added, right?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> While I understand the motivation for a better solution here, I think the
>>>>>>>>>> pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>>>>>>> invasive of a change to backport once you consider the parts needed to
>>>>>>>>>> make dsv2 stable.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I believe that those of us working on DSv2 are confident about the current
>>>>>>>>>> stability. We set goals for what to get into the 3.0 release months ago
>>>>>>>>>> and have very nearly reached the point where we are ready for that
>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don't think instability would be a problem in maintaining compatibility
>>>>>>>>>> between the 2.5 version and the 3.0 version. If we find that we need to
>>>>>>>>>> make API changes (other than additions) then we can make those in the 3.1
>>>>>>>>>> release. Because the goals we set for the 3.0 release have been reached
>>>>>>>>>> with the current API and if we are ready to release 3.0, we can release a
>>>>>>>>>> 2.5 with the same API.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@databricks.com > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> DSv2 is far from stable right? All the actual data types are unstable and
>>>>>>>>>>> you guys have completely ignored that. We'd need to work on that and that
>>>>>>>>>>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>>>>>>>>>>> 2.x, that seems too invasive of a change to backport once you consider the
>>>>>>>>>>> parts needed to make dsv2 stable.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@netflix.com.invalid > wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>>>>>>>>>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> A Spark 2.5 release with these two additions will help people migrate to
>>>>>>>>>>>> Spark 3.0 when it is released because they will be able to use a single
>>>>>>>>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>>>>>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>>>>>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>>>>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>>>>>>>>> it makes sense to share this work with the community.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> This release line would just consist of backports like DSv2 and Java 11
>>>>>>>>>>>> that assist compatibility, to keep the scope of the release small. The
>>>>>>>>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>>>>>>>> release.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>>>>>>>>>>> this plan?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> rb
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow,
but I think we would want to carefully consider whether that is the right
decision. And in any case, we would be able to keep 2.5 and 3.0 compatible,
which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <rx...@databricks.com> wrote:

> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience - especially we are not completely closed the chance to further
>>> modify DSv2, and the change could be backward incompatible.
>>>
>>> If we really want to bring the DSv2 change to 2.x version line to let
>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>> officially released, honestly even later than that, say, getting some
>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>> be frustrated to upgrade to next minor version.
>>>
>>> Btw, do we have any specific target users for this? Personally DSv2
>>> change would be the major backward incompatibility which Spark 2.x users
>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>
>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>>> I believe we follow Semantic Versioning (
>>>> https://spark.apache.org/versioning-policy.html ).
>>>>
>>>> > We just won’t add any breaking changes before 3.1.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow
>>>>>
>>>>> Sounds like we agree, then. We will use it for 3.0, but there are
>>>>> known problems with it.
>>>>>
>>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>>> seems like a false premise.
>>>>>
>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>
>>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>>> before 3.1.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>>> (which will change and progress towards more stable, but will have to break
>>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>>
>>>>>> To point out some problems with InternalRow that you think are
>>>>>> already pragmatic and stable:
>>>>>>
>>>>>> The class is in catalyst, which states:
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>
>>>>>> /**
>>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>>> classes in catalyst are
>>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>>> between minor releases.
>>>>>> */
>>>>>>
>>>>>> There is no even any annotation on the interface.
>>>>>>
>>>>>> The entire dependency chain were created to be private, and tightly
>>>>>> coupled with internal implementations. For example,
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>
>>>>>> /**
>>>>>> * A UTF-8 String for internal Spark use.
>>>>>> * <p>
>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>>> comparison,
>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>> * <p>
>>>>>> * Note: This is not designed for general use cases, should not be
>>>>>> used outside SQL.
>>>>>> */
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>
>>>>>> (which again is in catalyst package)
>>>>>>
>>>>>>
>>>>>> If you want to argue this way, you might as well argue we should make
>>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>>> changes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> When you created the PR to make InternalRow public
>>>>>>>
>>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>>>> of starting with that API.
>>>>>>>
>>>>>>> Note that any change to InternalRow would be very costly to
>>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>>> UnsafeRow was part of the original proposal.
>>>>>>>
>>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>>> InternalRow, it was to get the majority of SQL working on top of
>>>>>>> the interface added after 2.4. That’s done and stable, so I think a 2.5
>>>>>>> release with it is also reasonable.
>>>>>>>
>>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> To push back, while I agree we should not drastically change
>>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>>>>> back and forth.
>>>>>>>>
>>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>>
>>>>>>>>> No, I think it is reasonably stable and very close to being ready
>>>>>>>>> for a release.
>>>>>>>>>
>>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>>> completely ignored that.
>>>>>>>>>
>>>>>>>>> I think what you're referring to is the use of `InternalRow`.
>>>>>>>>> That's a stable API and there has been no work to avoid using it. In any
>>>>>>>>> case, I don't think that anyone is suggesting that we delay 3.0 until a
>>>>>>>>> replacement for `InternalRow` is added, right?
>>>>>>>>>
>>>>>>>>> While I understand the motivation for a better solution here, I
>>>>>>>>> think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>>
>>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems
>>>>>>>>> too invasive of a change to backport once you consider the parts needed to
>>>>>>>>> make dsv2 stable.
>>>>>>>>>
>>>>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>>>>> release.
>>>>>>>>>
>>>>>>>>> I don't think instability would be a problem in maintaining
>>>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>>>>> release a 2.5 with the same API.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> DSv2 is far from stable right? All the actual data types are
>>>>>>>>>> unstable and you guys have completely ignored that. We'd need to work on
>>>>>>>>>> that and that will be a breaking change. If the goal is to make DSv2 work
>>>>>>>>>> across 3.x and 2.x, that seems too invasive of a change to backport once
>>>>>>>>>> you consider the parts needed to make dsv2 stable.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>>>>>>>> added.
>>>>>>>>>>>
>>>>>>>>>>> A Spark 2.5 release with these two additions will help people
>>>>>>>>>>> migrate to Spark 3.0 when it is released because they will be able to use a
>>>>>>>>>>> single implementation for DSv2 sources that works in both 2.5 and 3.0.
>>>>>>>>>>> Similarly, upgrading to 3.0 won't also require also updating to Java 11
>>>>>>>>>>> because users could update to Java 11 with the 2.5 release and have fewer
>>>>>>>>>>> major changes.
>>>>>>>>>>>
>>>>>>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>>>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>>>>>>>> it makes sense to share this work with the community.
>>>>>>>>>>>
>>>>>>>>>>> This release line would just consist of backports like DSv2 and
>>>>>>>>>>> Java 11 that assist compatibility, to keep the scope of the release small.
>>>>>>>>>>> The purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>>>>>>> release.
>>>>>>>>>>>
>>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any
>>>>>>>>>>> concerns about this plan?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Reynold Xin <rx...@databricks.com>.
How would you not make incompatible changes in 3.x? As discussed the
InternalRow API is not stable and needs to change.

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:

> > Making downstream to diverge their implementation heavily between minor
> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>
> You're right that the API has been evolving in the 2.x line. But, it is
> now reasonably stable with respect to the current feature set and we should
> not need to break compatibility in the 3.x line. Because we have reached
> our goals for the 3.0 release, we can backport at least those features to
> 2.x and confidently have an API that works in both a 2.x release and is
> compatible with 3.0, if not 3.1 and later releases as well.
>
> > I'd rather say preparation of Spark 2.5 should be started after Spark
> 3.0 is officially released
>
> The reason I'm suggesting this is that I'm already going to do the work to
> backport the 3.0 release features to 2.4. I've been asked by several people
> when DSv2 will be released, so I know there is a lot of interest in making
> this available sooner than 3.0. If I'm already doing the work, then I'd be
> happy to share that with the community.
>
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
> about complete so we can easily release the same set of features and API in
> 2.5 and 3.0.
>
> If we decide for some reason to wait until after 3.0 is released, I don't
> know that there is much value in a 2.5. The purpose is to be a step toward
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
> wouldn't get these features out any sooner than 3.0, as a 2.5 release
> probably would, given the work needed to validate the incompatible changes
> in 3.0.
>
> > DSv2 change would be the major backward incompatibility which Spark 2.x
> users may hesitate to upgrade
>
> As I pointed out, DSv2 has been changing in the 2.x line, so this is
> expected. I don't think it will need incompatible changes in the 3.x line.
>
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>> before 3.1.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>> (which will change and progress towards more stable, but will have to break
>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>
>>>>> To point out some problems with InternalRow that you think are already
>>>>> pragmatic and stable:
>>>>>
>>>>> The class is in catalyst, which states:
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>
>>>>> /**
>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>> classes in catalyst are
>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>> between minor releases.
>>>>> */
>>>>>
>>>>> There is no even any annotation on the interface.
>>>>>
>>>>> The entire dependency chain were created to be private, and tightly
>>>>> coupled with internal implementations. For example,
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>
>>>>> /**
>>>>> * A UTF-8 String for internal Spark use.
>>>>> * <p>
>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>> comparison,
>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>> * <p>
>>>>> * Note: This is not designed for general use cases, should not be used
>>>>> outside SQL.
>>>>> */
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>
>>>>> (which again is in catalyst package)
>>>>>
>>>>>
>>>>> If you want to argue this way, you might as well argue we should make
>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> When you created the PR to make InternalRow public
>>>>>>
>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>>> of starting with that API.
>>>>>>
>>>>>> Note that any change to InternalRow would be very costly to
>>>>>> implement because this interface is widely used. That is why I think we can
>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>> UnsafeRow was part of the original proposal.
>>>>>>
>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>>> with it is also reasonable.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> To push back, while I agree we should not drastically change
>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>>>> back and forth.
>>>>>>>
>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>> understanding was to work towards making it stable in the future, assuming
>>>>>>> we will start with an unstable API temporarily. You can't just make a bunch
>>>>>>> internal APIs tightly coupled with other internal pieces public and stable
>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>
>>>>>>>> No, I think it is reasonably stable and very close to being ready
>>>>>>>> for a release.
>>>>>>>>
>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>> completely ignored that.
>>>>>>>>
>>>>>>>> I think what you're referring to is the use of `InternalRow`.
>>>>>>>> That's a stable API and there has been no work to avoid using it. In any
>>>>>>>> case, I don't think that anyone is suggesting that we delay 3.0 until a
>>>>>>>> replacement for `InternalRow` is added, right?
>>>>>>>>
>>>>>>>> While I understand the motivation for a better solution here, I
>>>>>>>> think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>
>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems
>>>>>>>> too invasive of a change to backport once you consider the parts needed to
>>>>>>>> make dsv2 stable.
>>>>>>>>
>>>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>>>> release.
>>>>>>>>
>>>>>>>> I don't think instability would be a problem in maintaining
>>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>>>> release a 2.5 with the same API.
>>>>>>>>
>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> DSv2 is far from stable right? All the actual data types are
>>>>>>>>> unstable and you guys have completely ignored that. We'd need to work on
>>>>>>>>> that and that will be a breaking change. If the goal is to make DSv2 work
>>>>>>>>> across 3.x and 2.x, that seems too invasive of a change to backport once
>>>>>>>>> you consider the parts needed to make dsv2 stable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>> rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>>>>>>> added.
>>>>>>>>>>
>>>>>>>>>> A Spark 2.5 release with these two additions will help people
>>>>>>>>>> migrate to Spark 3.0 when it is released because they will be able to use a
>>>>>>>>>> single implementation for DSv2 sources that works in both 2.5 and 3.0.
>>>>>>>>>> Similarly, upgrading to 3.0 won't also require also updating to Java 11
>>>>>>>>>> because users could update to Java 11 with the 2.5 release and have fewer
>>>>>>>>>> major changes.
>>>>>>>>>>
>>>>>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>>>>>>> it makes sense to share this work with the community.
>>>>>>>>>>
>>>>>>>>>> This release line would just consist of backports like DSv2 and
>>>>>>>>>> Java 11 that assist compatibility, to keep the scope of the release small.
>>>>>>>>>> The purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>>>>>> release.
>>>>>>>>>>
>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any
>>>>>>>>>> concerns about this plan?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
> Making downstream to diverge their implementation heavily between minor
versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now
reasonably stable with respect to the current feature set and we should not
need to break compatibility in the 3.x line. Because we have reached our
goals for the 3.0 release, we can backport at least those features to 2.x
and confidently have an API that works in both a 2.x release and is
compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0
is officially released

The reason I'm suggesting this is that I'm already going to do the work to
backport the 3.0 release features to 2.4. I've been asked by several people
when DSv2 will be released, so I know there is a lot of interest in making
this available sooner than 3.0. If I'm already doing the work, then I'd be
happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
about complete so we can easily release the same set of features and API in
2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't
know that there is much value in a 2.5. The purpose is to be a step toward
3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
wouldn't get these features out any sooner than 3.0, as a 2.5 release
probably would, given the work needed to validate the incompatible changes
in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x
users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is
expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <ka...@gmail.com> wrote:

> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet is
> that DSv2 would be already changed in incompatible way, at least who works
> for custom DataSource. Making downstream to diverge their implementation
> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
> experience - especially we are not completely closed the chance to further
> modify DSv2, and the change could be backward incompatible.
>
> If we really want to bring the DSv2 change to 2.x version line to let end
> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
> preparation of Spark 2.5 should be started after Spark 3.0 is officially
> released, honestly even later than that, say, getting some reports from
> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
> upgrade to next minor version.
>
> Btw, do we have any specific target users for this? Personally DSv2 change
> would be the major backward incompatibility which Spark 2.x users may
> hesitate to upgrade, so they might be already prepared to migrate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>> (which will change and progress towards more stable, but will have to break
>>>> certain APIs) and 2.x seems like a false premise.
>>>>
>>>> To point out some problems with InternalRow that you think are already
>>>> pragmatic and stable:
>>>>
>>>> The class is in catalyst, which states:
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>
>>>> /**
>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>> classes in catalyst are
>>>> * considered an internal API to Spark SQL and are subject to change
>>>> between minor releases.
>>>> */
>>>>
>>>> There is no even any annotation on the interface.
>>>>
>>>> The entire dependency chain were created to be private, and tightly
>>>> coupled with internal implementations. For example,
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>
>>>> /**
>>>> * A UTF-8 String for internal Spark use.
>>>> * <p>
>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>> comparison,
>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>> * <p>
>>>> * Note: This is not designed for general use cases, should not be used
>>>> outside SQL.
>>>> */
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>
>>>> (which again is in catalyst package)
>>>>
>>>>
>>>> If you want to argue this way, you might as well argue we should make
>>>> the entire catalyst package public to be pragmatic and not allow any
>>>> changes.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> When you created the PR to make InternalRow public
>>>>>
>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>> while both you and I did some work to avoid this, we are still in the phase
>>>>> of starting with that API.
>>>>>
>>>>> Note that any change to InternalRow would be very costly to implement
>>>>> because this interface is widely used. That is why I think we can certainly
>>>>> consider it stable enough to use here, and that’s probably why
>>>>> UnsafeRow was part of the original proposal.
>>>>>
>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 release
>>>>> with it is also reasonable.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> To push back, while I agree we should not drastically change
>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>> the Catalyst package or the unsafe package. External implementations should
>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>> back and forth.
>>>>>
>>>>> When you created the PR to make InternalRow public, the understanding
>>>>> was to work towards making it stable in the future, assuming we will start
>>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>>> tightly coupled with other internal pieces public and stable and call it a
>>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>>> the rest of Spark doesn't change.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> > DSv2 is far from stable right?
>>>>>
>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>> a release.
>>>>>
>>>>> > All the actual data types are unstable and you guys have completely
>>>>> ignored that.
>>>>>
>>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>>> stable API and there has been no work to avoid using it. In any case, I
>>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>>> for `InternalRow` is added, right?
>>>>>
>>>>> While I understand the motivation for a better solution here, I think
>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>
>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>> invasive of a change to backport once you consider the parts needed to make
>>>>> dsv2 stable.
>>>>>
>>>>> I believe that those of us working on DSv2 are confident about the
>>>>> current stability. We set goals for what to get into the 3.0 release months
>>>>> ago and have very nearly reached the point where we are ready for that
>>>>> release.
>>>>>
>>>>> I don't think instability would be a problem in maintaining
>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>> we need to make API changes (other than additions) then we can make those
>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>> release a 2.5 with the same API.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>>> the parts needed to make dsv2 stable.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rblue@netflix.com.invalid
>>>>> > wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>> added.
>>>>>
>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>
>>>>> Another reason to consider a 2.5 release is that many people are
>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>> it makes sense to share this work with the community.
>>>>>
>>>>> This release line would just consist of backports like DSv2 and Java
>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>> release.
>>>>>
>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>> about this plan?
>>>>>
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Jungtaek Lim <ka...@gmail.com>.
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal
with this as the change made confusion on my PRs...), but my bet is that
DSv2 would be already changed in incompatible way, at least who works for
custom DataSource. Making downstream to diverge their implementation
heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
experience - especially we are not completely closed the chance to further
modify DSv2, and the change could be backward incompatible.

If we really want to bring the DSv2 change to 2.x version line to let end
users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
preparation of Spark 2.5 should be started after Spark 3.0 is officially
released, honestly even later than that, say, getting some reports from
Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
upgrade to next minor version.

Btw, do we have any specific target users for this? Personally DSv2 change
would be the major backward incompatibility which Spark 2.x users may
hesitate to upgrade, so they might be already prepared to migrate to Spark
3.0 if they are prepared to migrate to new DSv2.

On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning (
> https://spark.apache.org/versioning-policy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * <p>
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * <p>
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> When you created the PR to make InternalRow public
>>>>
>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>>> Exposing this API has always been a part of DSv2 and while both you and I
>>>> did some work to avoid this, we are still in the phase of starting with
>>>> that API.
>>>>
>>>> Note that any change to InternalRow would be very costly to implement
>>>> because this interface is widely used. That is why I think we can certainly
>>>> consider it stable enough to use here, and that’s probably why
>>>> UnsafeRow was part of the original proposal.
>>>>
>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>>> it was to get the majority of SQL working on top of the interface added
>>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>>> reasonable.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> To push back, while I agree we should not drastically change
>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>> the Catalyst package or the unsafe package. External implementations should
>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>> back and forth.
>>>>
>>>> When you created the PR to make InternalRow public, the understanding
>>>> was to work towards making it stable in the future, assuming we will start
>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>> tightly coupled with other internal pieces public and stable and call it a
>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>> the rest of Spark doesn't change.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> > DSv2 is far from stable right?
>>>>
>>>> No, I think it is reasonably stable and very close to being ready for a
>>>> release.
>>>>
>>>> > All the actual data types are unstable and you guys have completely
>>>> ignored that.
>>>>
>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>> stable API and there has been no work to avoid using it. In any case, I
>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>> for `InternalRow` is added, right?
>>>>
>>>> While I understand the motivation for a better solution here, I think
>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>
>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>> invasive of a change to backport once you consider the parts needed to make
>>>> dsv2 stable.
>>>>
>>>> I believe that those of us working on DSv2 are confident about the
>>>> current stability. We set goals for what to get into the 3.0 release months
>>>> ago and have very nearly reached the point where we are ready for that
>>>> release.
>>>>
>>>> I don't think instability would be a problem in maintaining
>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>> we need to make API changes (other than additions) then we can make those
>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>> reached with the current API and if we are ready to release 3.0, we can
>>>> release a 2.5 with the same API.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>> and you guys have completely ignored that. We'd need to work on that and
>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>> the parts needed to make dsv2 stable.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>> added.
>>>>
>>>> A Spark 2.5 release with these two additions will help people migrate
>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>
>>>> Another reason to consider a 2.5 release is that many people are
>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>> it makes sense to share this work with the community.
>>>>
>>>> This release line would just consist of backports like DSv2 and Java 11
>>>> that assist compatibility, to keep the scope of the release small. The
>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>> release.
>>>>
>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>> about this plan?
>>>>
>>>>
>>>> rb
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for pointing this out, Dongjoon.

To clarify, I’m not suggesting that we can break compatibility. I’m
suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.

These APIs are marked unstable, so we could make changes to them if we
needed — as we have done in the 2.x line — but I don’t see a reason why we
would break compatibility in the 3.x line.

On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning (
> https://spark.apache.org/versioning-policy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * <p>
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * <p>
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> When you created the PR to make InternalRow public
>>>>
>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>>> Exposing this API has always been a part of DSv2 and while both you and I
>>>> did some work to avoid this, we are still in the phase of starting with
>>>> that API.
>>>>
>>>> Note that any change to InternalRow would be very costly to implement
>>>> because this interface is widely used. That is why I think we can certainly
>>>> consider it stable enough to use here, and that’s probably why
>>>> UnsafeRow was part of the original proposal.
>>>>
>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>>> it was to get the majority of SQL working on top of the interface added
>>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>>> reasonable.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> To push back, while I agree we should not drastically change
>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>> the Catalyst package or the unsafe package. External implementations should
>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>> back and forth.
>>>>
>>>> When you created the PR to make InternalRow public, the understanding
>>>> was to work towards making it stable in the future, assuming we will start
>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>> tightly coupled with other internal pieces public and stable and call it a
>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>> the rest of Spark doesn't change.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> > DSv2 is far from stable right?
>>>>
>>>> No, I think it is reasonably stable and very close to being ready for a
>>>> release.
>>>>
>>>> > All the actual data types are unstable and you guys have completely
>>>> ignored that.
>>>>
>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>> stable API and there has been no work to avoid using it. In any case, I
>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>> for `InternalRow` is added, right?
>>>>
>>>> While I understand the motivation for a better solution here, I think
>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>
>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>> invasive of a change to backport once you consider the parts needed to make
>>>> dsv2 stable.
>>>>
>>>> I believe that those of us working on DSv2 are confident about the
>>>> current stability. We set goals for what to get into the 3.0 release months
>>>> ago and have very nearly reached the point where we are ready for that
>>>> release.
>>>>
>>>> I don't think instability would be a problem in maintaining
>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>> we need to make API changes (other than additions) then we can make those
>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>> reached with the current API and if we are ready to release 3.0, we can
>>>> release a 2.5 with the same API.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>> and you guys have completely ignored that. We'd need to work on that and
>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>> the parts needed to make dsv2 stable.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>> added.
>>>>
>>>> A Spark 2.5 release with these two additions will help people migrate
>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>
>>>> Another reason to consider a 2.5 release is that many people are
>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>> it makes sense to share this work with the community.
>>>>
>>>> This release line would just consist of backports like DSv2 and Java 11
>>>> that assist compatibility, to keep the scope of the release small. The
>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>> release.
>>>>
>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>> about this plan?
>>>>
>>>>
>>>> rb
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Dongjoon Hyun <do...@gmail.com>.
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning (
https://spark.apache.org/versioning-policy.html ).

> We just won’t add any breaking changes before 3.1.

Bests,
Dongjoon.


On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I don’t think we need to gate a 3.0 release on making a more stable
> version of InternalRow
>
> Sounds like we agree, then. We will use it for 3.0, but there are known
> problems with it.
>
> Thinking we’d have dsv2 working in both 3.x (which will change and
> progress towards more stable, but will have to break certain APIs) and 2.x
> seems like a false premise.
>
> Why do you think we will need to break certain APIs before 3.0?
>
> I’m only suggesting that we release the same support in a 2.5 release that
> we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems
> like we can certainly do that. We just won’t add any breaking changes
> before 3.1.
>
> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> I don't think we need to gate a 3.0 release on making a more stable
>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>> (which will change and progress towards more stable, but will have to break
>> certain APIs) and 2.x seems like a false premise.
>>
>> To point out some problems with InternalRow that you think are already
>> pragmatic and stable:
>>
>> The class is in catalyst, which states:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>
>> /**
>> * Catalyst is a library for manipulating relational query plans.  All
>> classes in catalyst are
>> * considered an internal API to Spark SQL and are subject to change
>> between minor releases.
>> */
>>
>> There is no even any annotation on the interface.
>>
>> The entire dependency chain were created to be private, and tightly
>> coupled with internal implementations. For example,
>>
>>
>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>
>> /**
>> * A UTF-8 String for internal Spark use.
>> * <p>
>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>> comparison,
>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>> * <p>
>> * Note: This is not designed for general use cases, should not be used
>> outside SQL.
>> */
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>
>> (which again is in catalyst package)
>>
>>
>> If you want to argue this way, you might as well argue we should make the
>> entire catalyst package public to be pragmatic and not allow any changes.
>>
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>>> When you created the PR to make InternalRow public
>>>
>>> This isn’t quite accurate. The change I made was to use InternalRow
>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>> Exposing this API has always been a part of DSv2 and while both you and I
>>> did some work to avoid this, we are still in the phase of starting with
>>> that API.
>>>
>>> Note that any change to InternalRow would be very costly to implement
>>> because this interface is widely used. That is why I think we can certainly
>>> consider it stable enough to use here, and that’s probably why UnsafeRow
>>> was part of the original proposal.
>>>
>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>> it was to get the majority of SQL working on top of the interface added
>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>> reasonable.
>>>
>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> To push back, while I agree we should not drastically change
>>> "InternalRow", there are a lot of changes that need to happen to make it
>>> stable. For example, none of the publicly exposed interfaces should be in
>>> the Catalyst package or the unsafe package. External implementations should
>>> be decoupled from the internal implementations, with cheap ways to convert
>>> back and forth.
>>>
>>> When you created the PR to make InternalRow public, the understanding
>>> was to work towards making it stable in the future, assuming we will start
>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>> tightly coupled with other internal pieces public and stable and call it a
>>> day, just because it happen to satisfy some use cases temporarily assuming
>>> the rest of Spark doesn't change.
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> > DSv2 is far from stable right?
>>>
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>>
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>>
>>> I think what you're referring to is the use of `InternalRow`. That's a
>>> stable API and there has been no work to avoid using it. In any case, I
>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>> for `InternalRow` is added, right?
>>>
>>> While I understand the motivation for a better solution here, I think
>>> the pragmatic solution is to continue using `InternalRow`.
>>>
>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>> invasive of a change to backport once you consider the parts needed to make
>>> dsv2 stable.
>>>
>>> I believe that those of us working on DSv2 are confident about the
>>> current stability. We set goals for what to get into the 3.0 release months
>>> ago and have very nearly reached the point where we are ready for that
>>> release.
>>>
>>> I don't think instability would be a problem in maintaining
>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>> we need to make API changes (other than additions) then we can make those
>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>> reached with the current API and if we are ready to release 3.0, we can
>>> release a 2.5 with the same API.
>>>
>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> DSv2 is far from stable right? All the actual data types are unstable
>>> and you guys have completely ignored that. We'd need to work on that and
>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>> and 2.x, that seems too invasive of a change to backport once you consider
>>> the parts needed to make dsv2 stable.
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>>
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>>
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>>
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>>
>>>
>>> rb
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don’t think we need to gate a 3.0 release on making a more stable version
of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known
problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress
towards more stable, but will have to break certain APIs) and 2.x seems
like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that
we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems
like we can certainly do that. We just won’t add any breaking changes
before 3.1.

On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <rx...@databricks.com> wrote:

> I don't think we need to gate a 3.0 release on making a more stable
> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
> (which will change and progress towards more stable, but will have to break
> certain APIs) and 2.x seems like a false premise.
>
> To point out some problems with InternalRow that you think are already
> pragmatic and stable:
>
> The class is in catalyst, which states:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>
> /**
> * Catalyst is a library for manipulating relational query plans.  All
> classes in catalyst are
> * considered an internal API to Spark SQL and are subject to change
> between minor releases.
> */
>
> There is no even any annotation on the interface.
>
> The entire dependency chain were created to be private, and tightly
> coupled with internal implementations. For example,
>
>
> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>
> /**
> * A UTF-8 String for internal Spark use.
> * <p>
> * A String encoded in UTF-8 as an Array[Byte], which can be used for
> comparison,
> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
> * <p>
> * Note: This is not designed for general use cases, should not be used
> outside SQL.
> */
>
>
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>
> (which again is in catalyst package)
>
>
> If you want to argue this way, you might as well argue we should make the
> entire catalyst package public to be pragmatic and not allow any changes.
>
>
>
>
> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> When you created the PR to make InternalRow public
>>
>> This isn’t quite accurate. The change I made was to use InternalRow
>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>> Exposing this API has always been a part of DSv2 and while both you and I
>> did some work to avoid this, we are still in the phase of starting with
>> that API.
>>
>> Note that any change to InternalRow would be very costly to implement
>> because this interface is widely used. That is why I think we can certainly
>> consider it stable enough to use here, and that’s probably why UnsafeRow
>> was part of the original proposal.
>>
>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>> it was to get the majority of SQL working on top of the interface added
>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>> reasonable.
>>
>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations should
>> be decoupled from the internal implementations, with cheap ways to convert
>> back and forth.
>>
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start with
>> an unstable API temporarily. You can't just make a bunch internal APIs
>> tightly coupled with other internal pieces public and stable and call it a
>> day, just because it happen to satisfy some use cases temporarily assuming
>> the rest of Spark doesn't change.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to is the use of `InternalRow`. That's a
>> stable API and there has been no work to avoid using it. In any case, I
>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>> for `InternalRow` is added, right?
>>
>> While I understand the motivation for a better solution here, I think the
>> pragmatic solution is to continue using `InternalRow`.
>>
>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>> invasive of a change to backport once you consider the parts needed to make
>> dsv2 stable.
>>
>> I believe that those of us working on DSv2 are confident about the
>> current stability. We set goals for what to get into the 3.0 release months
>> ago and have very nearly reached the point where we are ready for that
>> release.
>>
>> I don't think instability would be a problem in maintaining compatibility
>> between the 2.5 version and the 3.0 version. If we find that we need to
>> make API changes (other than additions) then we can make those in the 3.1
>> release. Because the goals we set for the 3.0 release have been reached
>> with the current API and if we are ready to release 3.0, we can release a
>> 2.5 with the same API.
>>
>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release line would just consist of backports like DSv2 and Java 11
>> that assist compatibility, to keep the scope of the release small. The
>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>> release.
>>
>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>> this plan?
>>
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Reynold Xin <rx...@databricks.com>.
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise.

To point out some problems with InternalRow that you think are already pragmatic and stable:

The class is in catalyst, which states: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

/**

* Catalyst is a library for manipulating relational query plans.  All classes in catalyst are

* considered an internal API to Spark SQL and are subject to change between minor releases.

*/

There is no even any annotation on the interface.

The entire dependency chain were created to be private, and tightly coupled with internal implementations. For example, 

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

/**

* A UTF-8 String for internal Spark use.

* <p>

* A String encoded in UTF-8 as an Array[Byte], which can be used for comparison,

* search, see http://en.wikipedia.org/wiki/UTF-8 for details.

* <p>

* Note: This is not designed for general use cases, should not be used outside SQL.

*/

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala

(which again is in catalyst package)

If you want to argue this way, you might as well argue we should make the entire catalyst package public to be pragmatic and not allow any changes.

On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue < rblue@netflix.com > wrote:

> 
> 
>> 
>> 
>> When you created the PR to make InternalRow public
>> 
>> 
> 
> 
> 
> This isn’t quite accurate. The change I made was to use InternalRow instead
> of UnsafeRow , which is a specific implementation of InternalRow. Exposing
> this API has always been a part of DSv2 and while both you and I did some
> work to avoid this, we are still in the phase of starting with that API.
> 
> 
> 
> Note that any change to InternalRow would be very costly to implement
> because this interface is widely used. That is why I think we can
> certainly consider it stable enough to use here, and that’s probably why UnsafeRow
> was part of the original proposal.
> 
> 
> 
> In any case, the goal for 3.0 was not to replace the use of InternalRow ,
> it was to get the majority of SQL working on top of the interface added
> after 2.4. That’s done and stable, so I think a 2.5 release with it is
> also reasonable.
> 
> 
> 
> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin < rxin@databricks.com > wrote:
> 
> 
> 
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations
>> should be decoupled from the internal implementations, with cheap ways to
>> convert back and forth.
>> 
>> 
>> 
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start
>> with an unstable API temporarily. You can't just make a bunch internal
>> APIs tightly coupled with other internal pieces public and stable and call
>> it a day, just because it happen to satisfy some use cases temporarily
>> assuming the rest of Spark doesn't change.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rblue@netflix.com > wrote:
>> 
>>> > DSv2 is far from stable right?
>>> 
>>> 
>>> No, I think it is reasonably stable and very close to being ready for a
>>> release.
>>> 
>>> 
>>> > All the actual data types are unstable and you guys have completely
>>> ignored that.
>>> 
>>> 
>>> I think what you're referring to is the use of `InternalRow`. That's a
>>> stable API and there has been no work to avoid using it. In any case, I
>>> don't think that anyone is suggesting that we delay 3.0 until a
>>> replacement for `InternalRow` is added, right?
>>> 
>>> 
>>> While I understand the motivation for a better solution here, I think the
>>> pragmatic solution is to continue using `InternalRow`.
>>> 
>>> 
>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>> invasive of a change to backport once you consider the parts needed to
>>> make dsv2 stable.
>>> 
>>> 
>>> I believe that those of us working on DSv2 are confident about the current
>>> stability. We set goals for what to get into the 3.0 release months ago
>>> and have very nearly reached the point where we are ready for that
>>> release.
>>> 
>>> 
>>> I don't think instability would be a problem in maintaining compatibility
>>> between the 2.5 version and the 3.0 version. If we find that we need to
>>> make API changes (other than additions) then we can make those in the 3.1
>>> release. Because the goals we set for the 3.0 release have been reached
>>> with the current API and if we are ready to release 3.0, we can release a
>>> 2.5 with the same API.
>>> 
>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@databricks.com > wrote:
>>> 
>>> 
>>> 
>>>> DSv2 is far from stable right? All the actual data types are unstable and
>>>> you guys have completely ignored that. We'd need to work on that and that
>>>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>>>> 2.x, that seems too invasive of a change to backport once you consider the
>>>> parts needed to make dsv2 stable.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@netflix.com.invalid > wrote:
>>>> 
>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>>>> 
>>>>> 
>>>>> A Spark 2.5 release with these two additions will help people migrate to
>>>>> Spark 3.0 when it is released because they will be able to use a single
>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>> 
>>>>> 
>>>>> 
>>>>> Another reason to consider a 2.5 release is that many people are
>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>> it makes sense to share this work with the community.
>>>>> 
>>>>> 
>>>>> This release line would just consist of backports like DSv2 and Java 11
>>>>> that assist compatibility, to keep the scope of the release small. The
>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>> release.
>>>>> 
>>>>> 
>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>>>> this plan?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> rb
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead
of UnsafeRow, which is a specific implementation of InternalRow. Exposing
this API has always been a part of DSv2 and while both you and I did some
work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement
because this interface is widely used. That is why I think we can certainly
consider it stable enough to use here, and that’s probably why UnsafeRow
was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it
was to get the majority of SQL working on top of the interface added after
2.4. That’s done and stable, so I think a 2.5 release with it is also
reasonable.

On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <rx...@databricks.com> wrote:

> To push back, while I agree we should not drastically change
> "InternalRow", there are a lot of changes that need to happen to make it
> stable. For example, none of the publicly exposed interfaces should be in
> the Catalyst package or the unsafe package. External implementations should
> be decoupled from the internal implementations, with cheap ways to convert
> back and forth.
>
> When you created the PR to make InternalRow public, the understanding was
> to work towards making it stable in the future, assuming we will start with
> an unstable API temporarily. You can't just make a bunch internal APIs
> tightly coupled with other internal pieces public and stable and call it a
> day, just because it happen to satisfy some use cases temporarily assuming
> the rest of Spark doesn't change.
>
>
>
> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to is the use of `InternalRow`. That's a
>> stable API and there has been no work to avoid using it. In any case, I
>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>> for `InternalRow` is added, right?
>>
>> While I understand the motivation for a better solution here, I think the
>> pragmatic solution is to continue using `InternalRow`.
>>
>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>> invasive of a change to backport once you consider the parts needed to make
>> dsv2 stable.
>>
>> I believe that those of us working on DSv2 are confident about the
>> current stability. We set goals for what to get into the 3.0 release months
>> ago and have very nearly reached the point where we are ready for that
>> release.
>>
>> I don't think instability would be a problem in maintaining compatibility
>> between the 2.5 version and the 3.0 version. If we find that we need to
>> make API changes (other than additions) then we can make those in the 3.1
>> release. Because the goals we set for the 3.0 release have been reached
>> with the current API and if we are ready to release 3.0, we can release a
>> 2.5 with the same API.
>>
>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release line would just consist of backports like DSv2 and Java 11
>> that assist compatibility, to keep the scope of the release small. The
>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>> release.
>>
>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>> this plan?
>>
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Reynold Xin <rx...@databricks.com>.
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled from the internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to work towards making it stable in the future, assuming we will start with an unstable API temporarily. You can't just make a bunch internal APIs tightly coupled with other internal pieces public and stable and call it a day, just because it happen to satisfy some use cases temporarily assuming the rest of Spark doesn't change.

On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rblue@netflix.com > wrote:

> 
> > DSv2 is far from stable right?
> 
> 
> No, I think it is reasonably stable and very close to being ready for a
> release.
> 
> 
> > All the actual data types are unstable and you guys have completely
> ignored that.
> 
> 
> I think what you're referring to is the use of `InternalRow`. That's a
> stable API and there has been no work to avoid using it. In any case, I
> don't think that anyone is suggesting that we delay 3.0 until a
> replacement for `InternalRow` is added, right?
> 
> 
> While I understand the motivation for a better solution here, I think the
> pragmatic solution is to continue using `InternalRow`.
> 
> 
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
> invasive of a change to backport once you consider the parts needed to
> make dsv2 stable.
> 
> 
> I believe that those of us working on DSv2 are confident about the current
> stability. We set goals for what to get into the 3.0 release months ago
> and have very nearly reached the point where we are ready for that
> release.
> 
> 
> I don't think instability would be a problem in maintaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are ready to release 3.0, we can release a
> 2.5 with the same API.
> 
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@ netflix. com. invalid
>> ( rblue@netflix.com.invalid ) > wrote:
>> 
>>> Hi everyone,
>>> 
>>> 
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>> 
>>> 
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>> 
>>> 
>>> 
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>> 
>>> 
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>> 
>>> 
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>> 
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
> DSv2 is far from stable right?

No, I think it is reasonably stable and very close to being ready for a
release.

> All the actual data types are unstable and you guys have completely
ignored that.

I think what you're referring to is the use of `InternalRow`. That's a
stable API and there has been no work to avoid using it. In any case, I
don't think that anyone is suggesting that we delay 3.0 until a replacement
for `InternalRow` is added, right?

While I understand the motivation for a better solution here, I think the
pragmatic solution is to continue using `InternalRow`.

> If the goal is to make DSv2 work across 3.x and 2.x, that seems too
invasive of a change to backport once you consider the parts needed to make
dsv2 stable.

I believe that those of us working on DSv2 are confident about the current
stability. We set goals for what to get into the 3.0 release months ago and
have very nearly reached the point where we are ready for that release.

I don't think instability would be a problem in maintaining compatibility
between the 2.5 version and the 3.0 version. If we find that we need to
make API changes (other than additions) then we can make those in the 3.1
release. Because the goals we set for the 3.0 release have been reached
with the current API and if we are ready to release 3.0, we can release a
2.5 with the same API.

On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <rx...@databricks.com> wrote:

> DSv2 is far from stable right? All the actual data types are unstable and
> you guys have completely ignored that. We'd need to work on that and that
> will be a breaking change. If the goal is to make DSv2 work across 3.x and
> 2.x, that seems too invasive of a change to backport once you consider the
> parts needed to make dsv2 stable.
>
>
>
> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release line would just consist of backports like DSv2 and Java 11
>> that assist compatibility, to keep the scope of the release small. The
>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>> release.
>>
>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>> this plan?
>>
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Posted by Reynold Xin <rx...@databricks.com>.
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider the parts needed to make dsv2 stable.

On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@netflix.com.invalid > wrote:

> 
> Hi everyone,
> 
> 
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> 
> 
> A Spark 2.5 release with these two additions will help people migrate to
> Spark 3.0 when it is released because they will be able to use a single
> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
> upgrading to 3.0 won't also require also updating to Java 11 because users
> could update to Java 11 with the 2.5 release and have fewer major changes.
> 
> 
> 
> Another reason to consider a 2.5 release is that many people are
> interested in a release with the latest DSv2 API and support for DSv2 SQL.
> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
> it makes sense to share this work with the community.
> 
> 
> This release line would just consist of backports like DSv2 and Java 11
> that assist compatibility, to keep the scope of the release small. The
> purpose is to assist people moving to 3.0 and not distract from the 3.0
> release.
> 
> 
> Would a Spark 2.5 release help anyone else? Are there any concerns about
> this plan?
> 
> 
> 
> 
> rb
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>