You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2020/06/12 21:37:49 UTC

Revisiting the idea of a Spark 2.5 transitional release

Hi Folks,

As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release.
Spark 3 brings a number of important changes, and by its nature is not
backward compatible. I think we'd all like to have as smooth an upgrade
experience to Spark 3 as possible, and I believe that having a Spark 2
release some of the new functionality while continuing to support the older
APIs and current Scala version would make the upgrade path smoother.

This pattern is not uncommon in other Hadoop ecosystem projects, like
Hadoop itself and HBase.

I know that Ryan Blue has indicated he is already going to be maintaining
something like that internally at Netflix, and we'll be doing the same
thing at Apple. It seems like having a transitional release could benefit
the community with easy migrations and help avoid duplicated work.

I want to be clear I'm volunteering to do the work of managing a 2.5
release, so hopefully, this wouldn't create any substantial burdens on the
community.

Cheers,

Holden
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Sean Owen <sr...@gmail.com>.

These two are coupled, and in tension: don't want to take much change, but
do want changes that will unfortunately be somewhat breaking. A 2.5 release
with these items would be different enough as to strain the general level
of compatibility implied by a minor release. Sure, it's not 'just' a
maintenance release, but de facto it becomes the maintenance branch of all
of 2.x then, so kind of us. 2.4.x users then need to move to 2.5 too as
eventually it's the only 2.x maintenance branch. OK, you can maintain 2.4.x
and 2.5.x until 2.x is EOL, which does increase the complexity: everything
backported goes to 2 branches, has to work with both.

I don't know if there's a reason to cut 2.5.0 just on principle; it had
seemed pretty clear to me with 3.0 that 2.4.x was simply the last 2.x
release. We normally maintain version x and x+1, and will expand to
maintain 2.x + 3.0.x + 3.1.x soon. So it does depend on what would go in it.

One person's breaking change is another person's just-fine enhancement
though. People wouldn't suggest it here unless they were in the latter
group (though are we all talking about the same two major items?)
What I don't know is how that looks across the wider user base. Obviously,
here are a few important votes in favor. On the other hand I haven't heard
of significant issues in updating to 3.0 during the preview releases, which
could suggest that users that DSv2 et al can just move to 3.0.

On the items: I don't know enough about DSv2 to say, but that seems like a
big change to back port.
On JDK11: I understand Java 8 is EOL w.r.t. Oracle, but OpenJDK 8 is still
being updated, and even Oracle supports it (for $). I have not perceived
this to be a significant issue inside or outside Spark, anecdotally.

Yes, this can also be where downstream vendors supply support for a
specialized hybrid build.

I'm not sure there's an objectively right call here, certainly without more
than anecdotal or personal perspective on the tradeoffs. It still seems
like the current plan is fine to me though, to leave these items in 3.0.

We can also wait-and-see. If after 3.0 is GA there is clearly wide demand
for a transitional release, that could change the calculation.

On Fri, Jun 12, 2020 at 11:40 PM DB Tsai <db...@dbtsai.com> wrote:

> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>
> We had an internal preview version of Spark 3.0 for our customers to try
> out for a while, and then we realized that it's very challenging for
> enterprise applications in production to move to Spark 3.0. For example,
> many of our customers' Spark applications depend on some internal projects
> that may not be owned by ETL teams; it requires much coordination with
> other teams to cross-build the dependencies that Spark applications depend
> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
> from 2.x version to 3.0 based on my observation working with our customers.
>
> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
> by the infra team, and requires an exception to use unsupported JDK. Of
> course, for those companies, they can use vendor's Spark distribution such
> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
> release which is possible but not very trivial.
>
> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
> can definitely lower the gap, and users can still move forward using new
> features. Afterall, the reason why we are working on OSS is we like people
> to use our code, isn't it?
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by DB Tsai <d_...@apple.com.INVALID>.

For example, JDK11 requires dependency changes which can not go into 2.4.7. Recent development on Kube such as supporting dynamical allocation in Spark 3.0 in Kube (without shuffle service) will be hard to go in 2.4.7.

Sent from my iPhone

> On Jun 12, 2020, at 11:50 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> 
> Echoing Sean's earlier comment … What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? 
> 
> 
>> On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau <ho...@pigscanfly.ca> wrote:
>> Can I suggest we maybe decouple this conversation a bit? First, if there is an agreement in making a transitional release in principle and then folks who feel strongly about specific backports can have their respective discussions.It's not like we normally know or have agreement on everything going into a release at the time we cut the branch.
>> 
>> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <rx...@databricks.com> wrote:
>> I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. 
>> 
>> Spark is designed in a way that’s decoupled from storage, and as a result one can run multiple versions of Spark in parallel during migration. 
>> At the job level sure, but upgrading large jobs, possibly written in Scala 2.11, whole-hog as it currently stands is not a small matter. 
>> 
>> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <db...@dbtsai.com> wrote:
>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>> 
>> We had an internal preview version of Spark 3.0 for our customers to try out for a while, and then we realized that it's very challenging for enterprise applications in production to move to Spark 3.0. For example, many of our customers' Spark applications depend on some internal projects that may not be owned by ETL teams; it requires much coordination with other teams to cross-build the dependencies that Spark applications depend on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate from 2.x version to 3.0 based on my observation working with our customers.
>> 
>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported by the infra team, and requires an exception to use unsupported JDK. Of course, for those companies, they can use vendor's Spark distribution such as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark release which is possible but not very trivial.
>> 
>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support can definitely lower the gap, and users can still move forward using new features. Afterall, the reason why we are working on OSS is we like people to use our code, isn't it?
>> 
>> Sincerely,
>> 
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>> 
>> 
>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <ka...@gmail.com> wrote:
>> I guess we already went through the same discussion, right? If anyone is missed, please go through the discussion thread. [1] The consensus looks to be not positive to migrate the new DSv2 into Spark 2.x version line, because the change is pretty much huge, and also backward incompatible.
>> 
>> What I can think of benefits of having Spark 2.5 is to avoid force upgrade to the major release to have fixes for critical bugs. Not all critical fixes were landed to 2.x as well, because some fixes bring backward incompatibility. We don't land these fixes to the 2.x version line because we didn't consider having Spark 2.5 before - we don't want to let end users tolerate the inconvenience during upgrading bugfix version. End users may be OK to tolerate during upgrading minor version, since they can still live with 2.4.x to deny these fixes.
>> 
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we might want to consider porting some of features which don't bring backward incompatibility. Well, new major features of Spark 3.0 would be probably better to be introduced in Spark 3.0, but some features could be, especially if the feature resolves the long-standing issue or the feature has been provided for a long time in competitive products.
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 1. http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>> 
>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>> 
>> There are a lot of big differences between the API in 2.4 and 3.0, and I think a release to help migrate would be beneficial to organizations like ours that will be supporting 2.x and 3.0 in parallel for quite a while. Migration to Spark 3 is going to take time as people build confidence in it. I don't think that can be avoided by leaving a larger feature gap between 2.x and 3.0.
>> 
>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:
>> Based on my understanding, DSV2 is not stable yet. It still misses various features. Even our built-in file sources are still unable to fully migrate to DSV2. We plan to enhance it in the next few releases to close the gap. 
>> 
>> Also, the changes on DSV2 in Spark 3.0 did not break any existing application. We should encourage more users to try Spark 3 and increase the adoption of Spark 3.x. 
>> 
>> Xiao 
>> 
>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>> So I one of the things which we’re planning on backporting internally is DSv2, which I think being available in a community release in a 2 branch would be more broadly useful. Anything else on top of that would be on a case by case basis for if they make an easier upgrade path to 3.
>> 
>> If we’re worried about people using 2.5 as a long term home we could always mark it with “-transitional” or something similar?
>> 
>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>> What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x maintenance branch, and I personally could imagine being open to more freely backporting a few new features for 2.x users, whereas usually it's only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance branch but there's something too big for a 'normal' maintenance release, and I think the whole question turns on what that is.
>> 
>> If it's things like JDK 11 support, I think that is unfortunately fairly 'breaking' because of dependency updates. But maybe that's not it.
>> 
>> 
>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>> Hi Folks,
>> 
>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5 release. Spark 3 brings a number of important changes, and by its nature is not backward compatible. I think we'd all like to have as smooth an upgrade experience to Spark 3 as possible, and I believe that having a Spark 2 release some of the new functionality while continuing to support the older APIs and current Scala version would make the upgrade path smoother.
>> 
>> This pattern is not uncommon in other Hadoop ecosystem projects, like Hadoop itself and HBase.
>> 
>> I know that Ryan Blue has indicated he is already going to be maintaining something like that internally at Netflix, and we'll be doing the same thing at Apple. It seems like having a transitional release could benefit the community with easy migrations and help avoid duplicated work.
>> 
>> I want to be clear I'm volunteering to do the work of managing a 2.5 release, so hopefully, this wouldn't create any substantial burdens on the community.
>> 
>> Cheers,
>> 
>> Holden
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> 
>> 
>> -- 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Reynold Xin <rx...@databricks.com>.

Echoing Sean's earlier comment … What is the functionality that would go into a 2.5.0 release, that can't be in a 2.4.7 release?

On Fri, Jun 12, 2020 at 11:14 PM, Holden Karau < holden@pigscanfly.ca > wrote:

> 
> Can I suggest we maybe decouple this conversation a bit? First, if there
> is an agreement in making a transitional release in principle and then
> folks who feel strongly about specific backports can have their respective
> discussions. It ( http://discussions.it/ ) 's not like we normally know or
> have agreement on everything going into a release at the time we cut the
> branch.
> 
> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> I understand the argument to add JDK 11 support just to extend the EOL,
>> but the other things seem kind of arbitrary and are not supported by your
>> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
>> stable yet and will continue to evolve in the 3.x line.
>> 
>> 
>> Spark is designed in a way that’s decoupled from storage, and as a result
>> one can run multiple versions of Spark in parallel during migration.
>> 
> 
> At the job level sure, but upgrading large jobs, possibly written in Scala
> 2.11, whole-hog as it currently stands is not a small matter.
> 
>> 
>> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai < dbtsai@ dbtsai. com (
>> dbtsai@dbtsai.com ) > wrote:
>> 
>> 
>>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>>> 
>>> 
>>> 
>>> We had an internal preview version of Spark 3.0 for our customers to try
>>> out for a while, and then we realized that it's very challenging for
>>> enterprise applications in production to move to Spark 3.0. For example,
>>> many of our customers' Spark applications depend on some internal projects
>>> that may not be owned by ETL teams; it requires much coordination with
>>> other teams to cross-build the dependencies that Spark applications depend
>>> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
>>> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
>>> from 2.x version to 3.0 based on my observation working with our
>>> customers.
>>> 
>>> 
>>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
>>> by the infra team, and requires an exception to use unsupported JDK. Of
>>> course, for those companies, they can use vendor's Spark distribution such
>>> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
>>> release which is possible but not very trivial.
>>> 
>>> 
>>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
>>> can definitely lower the gap, and users can still move forward using new
>>> features. Afterall, the reason why we are working on OSS is we like people
>>> to use our code, isn't it?
>>> 
>>> Sincerely,
>>> 
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web: https:/ / www. dbtsai. com ( https://www.dbtsai.com )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim < kabhwan. opensource@ gmail.
>>> com ( kabhwan.opensource@gmail.com ) > wrote:
>>> 
>>> 
>>>> I guess we already went through the same discussion, right? If anyone is
>>>> missed, please go through the discussion thread. [1] The consensus looks
>>>> to be not positive to migrate the new DSv2 into Spark 2.x version line,
>>>> because the change is pretty much huge, and also backward incompatible.
>>>> 
>>>> 
>>>> What I can think of benefits of having Spark 2.5 is to avoid force upgrade
>>>> to the major release to have fixes for critical bugs. Not all critical
>>>> fixes were landed to 2.x as well, because some fixes bring backward
>>>> incompatibility. We don't land these fixes to the 2.x version line because
>>>> we didn't consider having Spark 2.5 before - we don't want to let end
>>>> users tolerate the inconvenience during upgrading bugfix version. End
>>>> users may be OK to tolerate during upgrading minor version, since they can
>>>> still live with 2.4.x to deny these fixes.
>>>> 
>>>> 
>>>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>>>> might want to consider porting some of features which don't bring backward
>>>> incompatibility. Well, new major features of Spark 3.0 would be probably
>>>> better to be introduced in Spark 3.0, but some features could be,
>>>> especially if the feature resolves the long-standing issue or the feature
>>>> has been provided for a long time in competitive products.
>>>> 
>>>> 
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>> 
>>>> 
>>>> 1. http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/ DISCUSS-Spark-2-5-release-td27963.
>>>> html#a27979 (
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>>> )
>>>> 
>>>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue < rblue@ netflix. com. invalid (
>>>> rblue@netflix.com.invalid ) > wrote:
>>>> 
>>>> 
>>>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>>> 
>>>>> 
>>>>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>>>>> think a release to help migrate would be beneficial to organizations like
>>>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>>>> Migration to Spark 3 is going to take time as people build confidence in
>>>>> it. I don't think that can be avoided by leaving a larger feature gap
>>>>> between 2.x and 3.0.
>>>>> 
>>>>> 
>>>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li < lixiao@ databricks. com (
>>>>> lixiao@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Based on my understanding, DSV2 is not stable yet. It still misses various
>>>>>> features. Even our built-in file sources are still unable to fully migrate
>>>>>> to DSV2. We plan to enhance it in the next few releases to close the gap.
>>>>>> 
>>>>>> 
>>>>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>>>>> application. We should encourage more users to try Spark 3 and increase
>>>>>> the adoption of Spark 3.x.
>>>>>> 
>>>>>> 
>>>>>> Xiao
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau < holden@ pigscanfly. ca (
>>>>>> holden@pigscanfly.ca ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> So I one of the things which we’re planning on backporting internally is
>>>>>>> DSv2, which I think being available in a community release in a 2 branch
>>>>>>> would be more broadly useful. Anything else on top of that would be on a
>>>>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>>>> 
>>>>>>> 
>>>>>>> If we’re worried about people using 2.5 as a long term home we could
>>>>>>> always mark it with “-transitional” or something similar?
>>>>>>> 
>>>>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen < srowen@ gmail. com (
>>>>>>> srowen@gmail.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> What is the functionality that would go into a 2.5.0 release, that can't
>>>>>>>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>>>>>>>> maintenance branch, and I personally could imagine being open to more
>>>>>>>> freely backporting a few new features for 2.x users, whereas usually it's
>>>>>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>>>>>> branch but there's something too big for a 'normal' maintenance release,
>>>>>>>> and I think the whole question turns on what that is.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If it's things like JDK 11 support, I think that is unfortunately fairly
>>>>>>>> 'breaking' because of dependency updates. But maybe that's not it.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau < holden@ pigscanfly. ca (
>>>>>>>> holden@pigscanfly.ca ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi Folks,
>>>>>>>>> 
>>>>>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>>>>>> release. Spark 3 brings a number of important changes, and by its nature
>>>>>>>>> is not backward compatible. I think we'd all like to have as smooth an
>>>>>>>>> upgrade experience to Spark 3 as possible, and I believe that having a
>>>>>>>>> Spark 2 release some of the new functionality while continuing to support
>>>>>>>>> the older APIs and current Scala version would make the upgrade path
>>>>>>>>> smoother.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>>>>>>>> Hadoop itself and HBase.
>>>>>>>>> 
>>>>>>>>> I know that Ryan Blue has indicated he is already going to be maintaining
>>>>>>>>> something like that internally at Netflix, and we'll be doing the same
>>>>>>>>> thing at Apple. It seems like having a transitional release could benefit
>>>>>>>>> the community with easy migrations and help avoid duplicated work.
>>>>>>>>> 
>>>>>>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>>>>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>>>>>>> community.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Holden
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Twitter: https:/ / twitter. com/ holdenkarau (
>>>>>>>>> https://twitter.com/holdenkarau )
>>>>>>>>> 
>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 2MaRAG9
>>>>>>>>> ( https://amzn.to/2MaRAG9 )
>>>>>>>>> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
>>>>>>>>> https://www.youtube.com/user/holdenkarau )
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Twitter: https:/ / twitter. com/ holdenkarau (
>>>>>>> https://twitter.com/holdenkarau )
>>>>>>> 
>>>>>>> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 2MaRAG9
>>>>>>> ( https://amzn.to/2MaRAG9 )
>>>>>>> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
>>>>>>> https://www.youtube.com/user/holdenkarau )
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> ( https://databricks.com/sparkaisummit/north-america )
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Holden Karau <ho...@pigscanfly.ca>.

Can I suggest we maybe decouple this conversation a bit? First, if there is
an agreement in making a transitional release in principle and then folks
who feel strongly about specific backports can have their respective
discussions.It's not like we normally know or have agreement on everything
going into a release at the time we cut the branch.

On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin <rx...@databricks.com> wrote:

> I understand the argument to add JDK 11 support just to extend the EOL,
> but the other things seem kind of arbitrary and are not supported by your
> arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
> stable yet and will continue to evolve in the 3.x line.
>
> Spark is designed in a way that’s decoupled from storage, and as a result
> one can run multiple versions of Spark in parallel during migration.
>
At the job level sure, but upgrading large jobs, possibly written in Scala
2.11, whole-hog as it currently stands is not a small matter.

>
> On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <db...@dbtsai.com> wrote:
>
>> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>>
>> We had an internal preview version of Spark 3.0 for our customers to try
>> out for a while, and then we realized that it's very challenging for
>> enterprise applications in production to move to Spark 3.0. For example,
>> many of our customers' Spark applications depend on some internal projects
>> that may not be owned by ETL teams; it requires much coordination with
>> other teams to cross-build the dependencies that Spark applications depend
>> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
>> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
>> from 2.x version to 3.0 based on my observation working with our customers.
>>
>> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
>> by the infra team, and requires an exception to use unsupported JDK. Of
>> course, for those companies, they can use vendor's Spark distribution such
>> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
>> release which is possible but not very trivial.
>>
>> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11
>> support can definitely lower the gap, and users can still move forward
>> using new features. Afterall, the reason why we are working on OSS is we
>> like people to use our code, isn't it?
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>>
>> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> I guess we already went through the same discussion, right? If anyone is
>>> missed, please go through the discussion thread. [1] The consensus looks to
>>> be not positive to migrate the new DSv2 into Spark 2.x version line,
>>> because the change is pretty much huge, and also backward incompatible.
>>>
>>> What I can think of benefits of having Spark 2.5 is to avoid force
>>> upgrade to the major release to have fixes for critical bugs. Not all
>>> critical fixes were landed to 2.x as well, because some fixes bring
>>> backward incompatibility. We don't land these fixes to the 2.x version line
>>> because we didn't consider having Spark 2.5 before - we don't want to let
>>> end users tolerate the inconvenience during upgrading bugfix version. End
>>> users may be OK to tolerate during upgrading minor version, since they can
>>> still live with 2.4.x to deny these fixes.
>>>
>>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>>> might want to consider porting some of features which don't bring backward
>>> incompatibility. Well, new major features of Spark 3.0 would be probably
>>> better to be introduced in Spark 3.0, but some features could be,
>>> especially if the feature resolves the long-standing issue or the feature
>>> has been provided for a long time in competitive products.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>>
>>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>>
>>>> There are a lot of big differences between the API in 2.4 and 3.0, and
>>>> I think a release to help migrate would be beneficial to organizations like
>>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>>> Migration to Spark 3 is going to take time as people build confidence in
>>>> it. I don't think that can be avoided by leaving a larger feature gap
>>>> between 2.x and 3.0.
>>>>
>>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> Based on my understanding, DSV2 is not stable yet. It still
>>>>> misses various features. Even our built-in file sources are still unable to
>>>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>>>> close the gap.
>>>>>
>>>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>>>> application. We should encourage more users to try Spark 3 and increase the
>>>>> adoption of Spark 3.x.
>>>>>
>>>>> Xiao
>>>>>
>>>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> So I one of the things which we’re planning on backporting internally
>>>>>> is DSv2, which I think being available in a community release in a 2 branch
>>>>>> would be more broadly useful. Anything else on top of that would be on a
>>>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>>>
>>>>>> If we’re worried about people using 2.5 as a long term home we could
>>>>>> always mark it with “-transitional” or something similar?
>>>>>>
>>>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> What is the functionality that would go into a 2.5.0 release, that
>>>>>>> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
>>>>>>> 2.x maintenance branch, and I personally could imagine being open to more
>>>>>>> freely backporting a few new features for 2.x users, whereas usually it's
>>>>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>>>>> branch but there's something too big for a 'normal' maintenance release,
>>>>>>> and I think the whole question turns on what that is.
>>>>>>>
>>>>>>> If it's things like JDK 11 support, I think that is unfortunately
>>>>>>> fairly 'breaking' because of dependency updates. But maybe that's not it.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>>>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>>>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>>>>>> release some of the new functionality while continuing to support the older
>>>>>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>>>>>
>>>>>>>> This pattern is not uncommon in other Hadoop ecosystem projects,
>>>>>>>> like Hadoop itself and HBase.
>>>>>>>>
>>>>>>>> I know that Ryan Blue has indicated he is already going to be
>>>>>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>>>>>> the same thing at Apple. It seems like having a transitional release could
>>>>>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>>>>>
>>>>>>>> I want to be clear I'm volunteering to do the work of managing a
>>>>>>>> 2.5 release, so hopefully, this wouldn't create any substantial burdens on
>>>>>>>> the community.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Reynold Xin <rx...@databricks.com>.

I understand the argument to add JDK 11 support just to extend the EOL, but
the other things seem kind of arbitrary and are not supported by your
arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
stable yet and will continue to evolve in the 3.x line.

Spark is designed in a way that’s decoupled from storage, and as a result
one can run multiple versions of Spark in parallel during migration.

On Fri, Jun 12, 2020 at 9:40 PM DB Tsai <db...@dbtsai.com> wrote:

> +1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support
>
> We had an internal preview version of Spark 3.0 for our customers to try
> out for a while, and then we realized that it's very challenging for
> enterprise applications in production to move to Spark 3.0. For example,
> many of our customers' Spark applications depend on some internal projects
> that may not be owned by ETL teams; it requires much coordination with
> other teams to cross-build the dependencies that Spark applications depend
> on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
> of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
> from 2.x version to 3.0 based on my observation working with our customers.
>
> Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
> by the infra team, and requires an exception to use unsupported JDK. Of
> course, for those companies, they can use vendor's Spark distribution such
> as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
> release which is possible but not very trivial.
>
> As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
> can definitely lower the gap, and users can still move forward using new
> features. Afterall, the reason why we are working on OSS is we like people
> to use our code, isn't it?
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
>
> On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> I guess we already went through the same discussion, right? If anyone is
>> missed, please go through the discussion thread. [1] The consensus looks to
>> be not positive to migrate the new DSv2 into Spark 2.x version line,
>> because the change is pretty much huge, and also backward incompatible.
>>
>> What I can think of benefits of having Spark 2.5 is to avoid force
>> upgrade to the major release to have fixes for critical bugs. Not all
>> critical fixes were landed to 2.x as well, because some fixes bring
>> backward incompatibility. We don't land these fixes to the 2.x version line
>> because we didn't consider having Spark 2.5 before - we don't want to let
>> end users tolerate the inconvenience during upgrading bugfix version. End
>> users may be OK to tolerate during upgrading minor version, since they can
>> still live with 2.4.x to deny these fixes.
>>
>> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
>> might want to consider porting some of features which don't bring backward
>> incompatibility. Well, new major features of Spark 3.0 would be probably
>> better to be introduced in Spark 3.0, but some features could be,
>> especially if the feature resolves the long-standing issue or the feature
>> has been provided for a long time in competitive products.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1.
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>>
>> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>>
>>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>>> think a release to help migrate would be beneficial to organizations like
>>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>>> Migration to Spark 3 is going to take time as people build confidence in
>>> it. I don't think that can be avoided by leaving a larger feature gap
>>> between 2.x and 3.0.
>>>
>>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> Based on my understanding, DSV2 is not stable yet. It still
>>>> misses various features. Even our built-in file sources are still unable to
>>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>>> close the gap.
>>>>
>>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>>> application. We should encourage more users to try Spark 3 and increase the
>>>> adoption of Spark 3.x.
>>>>
>>>> Xiao
>>>>
>>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> So I one of the things which we’re planning on backporting internally
>>>>> is DSv2, which I think being available in a community release in a 2 branch
>>>>> would be more broadly useful. Anything else on top of that would be on a
>>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>>
>>>>> If we’re worried about people using 2.5 as a long term home we could
>>>>> always mark it with “-transitional” or something similar?
>>>>>
>>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> What is the functionality that would go into a 2.5.0 release, that
>>>>>> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
>>>>>> 2.x maintenance branch, and I personally could imagine being open to more
>>>>>> freely backporting a few new features for 2.x users, whereas usually it's
>>>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>>>> branch but there's something too big for a 'normal' maintenance release,
>>>>>> and I think the whole question turns on what that is.
>>>>>>
>>>>>> If it's things like JDK 11 support, I think that is unfortunately
>>>>>> fairly 'breaking' because of dependency updates. But maybe that's not it.
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>>>>> release some of the new functionality while continuing to support the older
>>>>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>>>>
>>>>>>> This pattern is not uncommon in other Hadoop ecosystem projects,
>>>>>>> like Hadoop itself and HBase.
>>>>>>>
>>>>>>> I know that Ryan Blue has indicated he is already going to be
>>>>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>>>>> the same thing at Apple. It seems like having a transitional release could
>>>>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>>>>
>>>>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>>>>> community.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Holden
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>>
>>>> --
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by DB Tsai <db...@dbtsai.com>.

+1 for a 2.x release with DSv2, JDK11, and Scala 2.11 support

We had an internal preview version of Spark 3.0 for our customers to try
out for a while, and then we realized that it's very challenging for
enterprise applications in production to move to Spark 3.0. For example,
many of our customers' Spark applications depend on some internal projects
that may not be owned by ETL teams; it requires much coordination with
other teams to cross-build the dependencies that Spark applications depend
on with Scala 2.12 in order to use Spark 3.0. Now, we removed the support
of Scala 2.11 in Spark 3.0, this results in a really big gap to migrate
from 2.x version to 3.0 based on my observation working with our customers.

Also, JDK8 is already EOL, in some companies, using JDK8 is not supported
by the infra team, and requires an exception to use unsupported JDK. Of
course, for those companies, they can use vendor's Spark distribution such
as CDH Spark 2.4 which supports JDK11 or they can maintain their own Spark
release which is possible but not very trivial.

As a result, having a 2.5 release with DSv2, JDK11, and Scala 2.11 support
can definitely lower the gap, and users can still move forward using new
features. Afterall, the reason why we are working on OSS is we like people
to use our code, isn't it?

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


On Fri, Jun 12, 2020 at 8:51 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> I guess we already went through the same discussion, right? If anyone is
> missed, please go through the discussion thread. [1] The consensus looks to
> be not positive to migrate the new DSv2 into Spark 2.x version line,
> because the change is pretty much huge, and also backward incompatible.
>
> What I can think of benefits of having Spark 2.5 is to avoid force upgrade
> to the major release to have fixes for critical bugs. Not all critical
> fixes were landed to 2.x as well, because some fixes bring backward
> incompatibility. We don't land these fixes to the 2.x version line because
> we didn't consider having Spark 2.5 before - we don't want to let end users
> tolerate the inconvenience during upgrading bugfix version. End users may
> be OK to tolerate during upgrading minor version, since they can still live
> with 2.4.x to deny these fixes.
>
> In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
> might want to consider porting some of features which don't bring backward
> incompatibility. Well, new major features of Spark 3.0 would be probably
> better to be introduced in Spark 3.0, but some features could be,
> especially if the feature resolves the long-standing issue or the feature
> has been provided for a long time in competitive products.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979
>
> On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> +1 for a 2.x release with a DSv2 API that matches 3.0.
>>
>> There are a lot of big differences between the API in 2.4 and 3.0, and I
>> think a release to help migrate would be beneficial to organizations like
>> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
>> Migration to Spark 3 is going to take time as people build confidence in
>> it. I don't think that can be avoided by leaving a larger feature gap
>> between 2.x and 3.0.
>>
>> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:
>>
>>> Based on my understanding, DSV2 is not stable yet. It still
>>> misses various features. Even our built-in file sources are still unable to
>>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>>> close the gap.
>>>
>>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>>> application. We should encourage more users to try Spark 3 and increase the
>>> adoption of Spark 3.x.
>>>
>>> Xiao
>>>
>>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> So I one of the things which we’re planning on backporting internally
>>>> is DSv2, which I think being available in a community release in a 2 branch
>>>> would be more broadly useful. Anything else on top of that would be on a
>>>> case by case basis for if they make an easier upgrade path to 3.
>>>>
>>>> If we’re worried about people using 2.5 as a long term home we could
>>>> always mark it with “-transitional” or something similar?
>>>>
>>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> What is the functionality that would go into a 2.5.0 release, that
>>>>> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
>>>>> 2.x maintenance branch, and I personally could imagine being open to more
>>>>> freely backporting a few new features for 2.x users, whereas usually it's
>>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>>> branch but there's something too big for a 'normal' maintenance release,
>>>>> and I think the whole question turns on what that is.
>>>>>
>>>>> If it's things like JDK 11 support, I think that is unfortunately
>>>>> fairly 'breaking' because of dependency updates. But maybe that's not it.
>>>>>
>>>>>
>>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>>>> release some of the new functionality while continuing to support the older
>>>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>>>
>>>>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>>>>> Hadoop itself and HBase.
>>>>>>
>>>>>> I know that Ryan Blue has indicated he is already going to be
>>>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>>>> the same thing at Apple. It seems like having a transitional release could
>>>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>>>
>>>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>>>> community.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Jungtaek Lim <ka...@gmail.com>.

I guess we already went through the same discussion, right? If anyone is
missed, please go through the discussion thread. [1] The consensus looks to
be not positive to migrate the new DSv2 into Spark 2.x version line,
because the change is pretty much huge, and also backward incompatible.

What I can think of benefits of having Spark 2.5 is to avoid force upgrade
to the major release to have fixes for critical bugs. Not all critical
fixes were landed to 2.x as well, because some fixes bring backward
incompatibility. We don't land these fixes to the 2.x version line because
we didn't consider having Spark 2.5 before - we don't want to let end users
tolerate the inconvenience during upgrading bugfix version. End users may
be OK to tolerate during upgrading minor version, since they can still live
with 2.4.x to deny these fixes.

In addition, given there's a huge time gap between Spark 2.4 and 3.0, we
might want to consider porting some of features which don't bring backward
incompatibility. Well, new major features of Spark 3.0 would be probably
better to be introduced in Spark 3.0, but some features could be,
especially if the feature resolves the long-standing issue or the feature
has been provided for a long time in competitive products.

Thanks,
Jungtaek Lim (HeartSaVioR)

1.
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979

On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> +1 for a 2.x release with a DSv2 API that matches 3.0.
>
> There are a lot of big differences between the API in 2.4 and 3.0, and I
> think a release to help migrate would be beneficial to organizations like
> ours that will be supporting 2.x and 3.0 in parallel for quite a while.
> Migration to Spark 3 is going to take time as people build confidence in
> it. I don't think that can be avoided by leaving a larger feature gap
> between 2.x and 3.0.
>
> On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:
>
>> Based on my understanding, DSV2 is not stable yet. It still
>> misses various features. Even our built-in file sources are still unable to
>> fully migrate to DSV2. We plan to enhance it in the next few releases to
>> close the gap.
>>
>> Also, the changes on DSV2 in Spark 3.0 did not break any existing
>> application. We should encourage more users to try Spark 3 and increase the
>> adoption of Spark 3.x.
>>
>> Xiao
>>
>> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> So I one of the things which we’re planning on backporting internally is
>>> DSv2, which I think being available in a community release in a 2 branch
>>> would be more broadly useful. Anything else on top of that would be on a
>>> case by case basis for if they make an easier upgrade path to 3.
>>>
>>> If we’re worried about people using 2.5 as a long term home we could
>>> always mark it with “-transitional” or something similar?
>>>
>>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> What is the functionality that would go into a 2.5.0 release, that
>>>> can't be in a 2.4.7 release? I think that's the key question. 2.4.x is the
>>>> 2.x maintenance branch, and I personally could imagine being open to more
>>>> freely backporting a few new features for 2.x users, whereas usually it's
>>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>>> branch but there's something too big for a 'normal' maintenance release,
>>>> and I think the whole question turns on what that is.
>>>>
>>>> If it's things like JDK 11 support, I think that is unfortunately
>>>> fairly 'breaking' because of dependency updates. But maybe that's not it.
>>>>
>>>>
>>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>>> release some of the new functionality while continuing to support the older
>>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>>
>>>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>>>> Hadoop itself and HBase.
>>>>>
>>>>> I know that Ryan Blue has indicated he is already going to be
>>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>>> the same thing at Apple. It seems like having a transitional release could
>>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>>
>>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>>> community.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I
think a release to help migrate would be beneficial to organizations like
ours that will be supporting 2.x and 3.0 in parallel for quite a while.
Migration to Spark 3 is going to take time as people build confidence in
it. I don't think that can be avoided by leaving a larger feature gap
between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li <li...@databricks.com> wrote:

> Based on my understanding, DSV2 is not stable yet. It still misses various
> features. Even our built-in file sources are still unable to fully migrate
> to DSV2. We plan to enhance it in the next few releases to close the gap.
>
> Also, the changes on DSV2 in Spark 3.0 did not break any existing
> application. We should encourage more users to try Spark 3 and increase the
> adoption of Spark 3.x.
>
> Xiao
>
> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> So I one of the things which we’re planning on backporting internally is
>> DSv2, which I think being available in a community release in a 2 branch
>> would be more broadly useful. Anything else on top of that would be on a
>> case by case basis for if they make an easier upgrade path to 3.
>>
>> If we’re worried about people using 2.5 as a long term home we could
>> always mark it with “-transitional” or something similar?
>>
>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> What is the functionality that would go into a 2.5.0 release, that can't
>>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>>> maintenance branch, and I personally could imagine being open to more
>>> freely backporting a few new features for 2.x users, whereas usually it's
>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>> branch but there's something too big for a 'normal' maintenance release,
>>> and I think the whole question turns on what that is.
>>>
>>> If it's things like JDK 11 support, I think that is unfortunately fairly
>>> 'breaking' because of dependency updates. But maybe that's not it.
>>>
>>>
>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>> release some of the new functionality while continuing to support the older
>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>
>>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>>> Hadoop itself and HBase.
>>>>
>>>> I know that Ryan Blue has indicated he is already going to be
>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>> the same thing at Apple. It seems like having a transitional release could
>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>
>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>> community.
>>>>
>>>> Cheers,
>>>>
>>>> Holden
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Xiao Li <li...@databricks.com>.

Based on my understanding, DSV2 is not stable yet. It still misses various
features. Even our built-in file sources are still unable to fully migrate
to DSV2. We plan to enhance it in the next few releases to close the gap.

Also, the changes on DSV2 in Spark 3.0 did not break any existing
application. We should encourage more users to try Spark 3 and increase the
adoption of Spark 3.x.

Xiao

On Fri, Jun 12, 2020 at 5:36 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> So I one of the things which we’re planning on backporting internally is
> DSv2, which I think being available in a community release in a 2 branch
> would be more broadly useful. Anything else on top of that would be on a
> case by case basis for if they make an easier upgrade path to 3.
>
> If we’re worried about people using 2.5 as a long term home we could
> always mark it with “-transitional” or something similar?
>
> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:
>
>> What is the functionality that would go into a 2.5.0 release, that can't
>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>> maintenance branch, and I personally could imagine being open to more
>> freely backporting a few new features for 2.x users, whereas usually it's
>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>> branch but there's something too big for a 'normal' maintenance release,
>> and I think the whole question turns on what that is.
>>
>> If it's things like JDK 11 support, I think that is unfortunately fairly
>> 'breaking' because of dependency updates. But maybe that's not it.
>>
>>
>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Hi Folks,
>>>
>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>> release. Spark 3 brings a number of important changes, and by its nature is
>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>> release some of the new functionality while continuing to support the older
>>> APIs and current Scala version would make the upgrade path smoother.
>>>
>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>> Hadoop itself and HBase.
>>>
>>> I know that Ryan Blue has indicated he is already going to be
>>> maintaining something like that internally at Netflix, and we'll be doing
>>> the same thing at Apple. It seems like having a transitional release could
>>> benefit the community with easy migrations and help avoid duplicated work.
>>>
>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>> community.
>>>
>>> Cheers,
>>>
>>> Holden
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Holden Karau <ho...@pigscanfly.ca>.

So I one of the things which we’re planning on backporting internally is
DSv2, which I think being available in a community release in a 2 branch
would be more broadly useful. Anything else on top of that would be on a
case by case basis for if they make an easier upgrade path to 3.

If we’re worried about people using 2.5 as a long term home we could always
mark it with “-transitional” or something similar?

On Fri, Jun 12, 2020 at 4:33 PM Sean Owen <sr...@gmail.com> wrote:

> What is the functionality that would go into a 2.5.0 release, that can't
> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
> maintenance branch, and I personally could imagine being open to more
> freely backporting a few new features for 2.x users, whereas usually it's
> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
> branch but there's something too big for a 'normal' maintenance release,
> and I think the whole question turns on what that is.
>
> If it's things like JDK 11 support, I think that is unfortunately fairly
> 'breaking' because of dependency updates. But maybe that's not it.
>
>
> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Hi Folks,
>>
>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>> release. Spark 3 brings a number of important changes, and by its nature is
>> not backward compatible. I think we'd all like to have as smooth an upgrade
>> experience to Spark 3 as possible, and I believe that having a Spark 2
>> release some of the new functionality while continuing to support the older
>> APIs and current Scala version would make the upgrade path smoother.
>>
>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>> Hadoop itself and HBase.
>>
>> I know that Ryan Blue has indicated he is already going to be maintaining
>> something like that internally at Netflix, and we'll be doing the same
>> thing at Apple. It seems like having a transitional release could benefit
>> the community with easy migrations and help avoid duplicated work.
>>
>> I want to be clear I'm volunteering to do the work of managing a 2.5
>> release, so hopefully, this wouldn't create any substantial burdens on the
>> community.
>>
>> Cheers,
>>
>> Holden
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Sean Owen <sr...@gmail.com>.

What is the functionality that would go into a 2.5.0 release, that can't be
in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
maintenance branch, and I personally could imagine being open to more
freely backporting a few new features for 2.x users, whereas usually it's
only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
branch but there's something too big for a 'normal' maintenance release,
and I think the whole question turns on what that is.

If it's things like JDK 11 support, I think that is unfortunately fairly
'breaking' because of dependency updates. But maybe that's not it.

On Fri, Jun 12, 2020 at 4:38 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature is
> not backward compatible. I think we'd all like to have as smooth an upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be maintaining
> something like that internally at Netflix, and we'll be doing the same
> thing at Apple. It seems like having a transitional release could benefit
> the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Revisiting the idea of a Spark 2.5 transitional release

Posted by Xiao Li <li...@databricks.com>.

Which new functionalities are you referring to? In Spark SQL, most of the
major features in Spark 3.0 are difficult/time-consuming to backport. For
example, adaptive query execution. Releasing a new version is not hard, but
backporting/reviewing/maintaining these features are very time-consuming.

Which old APIs are broken? If the impact is big, we should add them back
based on our former discussion
http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html

Thanks,

Xiao

On Fri, Jun 12, 2020 at 2:38 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Folks,
>
> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
> release. Spark 3 brings a number of important changes, and by its nature is
> not backward compatible. I think we'd all like to have as smooth an upgrade
> experience to Spark 3 as possible, and I believe that having a Spark 2
> release some of the new functionality while continuing to support the older
> APIs and current Scala version would make the upgrade path smoother.
>
> This pattern is not uncommon in other Hadoop ecosystem projects, like
> Hadoop itself and HBase.
>
> I know that Ryan Blue has indicated he is already going to be maintaining
> something like that internally at Netflix, and we'll be doing the same
> thing at Apple. It seems like having a transitional release could benefit
> the community with easy migrations and help avoid duplicated work.
>
> I want to be clear I'm volunteering to do the work of managing a 2.5
> release, so hopefully, this wouldn't create any substantial burdens on the
> community.
>
> Cheers,
>
> Holden
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

-- 
<https://databricks.com/sparkaisummit/north-america>