You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2020/02/24 23:02:56 UTC

[Proposal] Modification to Spark's Semantic Versioning Policy

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including
myself), there have been many discussions around APIs that have been broken
compared with Spark 2.x. In many of these discussions, one of the
rationales for breaking an API seems to be "Spark follows semantic
versioning <https://spark.apache.org/versioning-policy.html>, so this major
release is our chance to get it right [by breaking APIs]". Similarly, in
many cases the response to questions about why an API was completely
removed has been, "this API has been deprecated since x.x, so we have to
remove it".

As a long time contributor to and user of Spark this interpretation of the
policy is concerning to me. This reasoning misses the intention of the
original policy, and I am worried that it will hurt the long-term success
of the project.

I definitely understand that these are hard decisions, and I'm not
proposing that we never remove anything from Spark. However, I would like
to give some additional context and also propose a different rubric for
thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for
the 1.0 release. As this was the first major release -- and as, up until
fairly recently, Spark had only been an academic project -- no real
promises had been made about API stability ever.

During the discussion, some committers suggested that this was an
opportunity to clean up cruft and give the Spark APIs a once-over, making
cosmetic changes to improve consistency. However, in the end, it was
decided that in many cases it was not in the best interests of the Spark
community to break things just because we could. Matei actually said it
pretty forcefully
<http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
:

I know that some names are suboptimal, but I absolutely detest breaking
APIs, config names, etc. I’ve seen it happen way too often in other
projects (even things we depend on that are officially post-1.0, like Akka
or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
cutting-edge users are okay with libraries occasionally changing, but many
others will consider it a show-stopper. Given this, I think that any
cosmetic change now, even though it might improve clarity slightly, is not
worth the tradeoff in terms of creating an update barrier for existing
users.

In the end, while some changes were made, most APIs remained the same and
users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
this served the project very well, as compatibility means users are able to
upgrade and we keep as many people on the latest versions of Spark (though
maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and
we should be more conservative rather than less. Today, there are very
likely more Spark programs running than there were at any other time in the
past. Spark is no longer a tool only used by advanced hackers, it is now
also running "traditional enterprise workloads.'' In many cases these jobs
are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency
can be even harder for users, as if the library has not been upgraded to
use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition
to our semantic versioning policy. After discussion and if people agree
this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
the official policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

Usage - an API that is actively used in many different places, is always
very costly to break. While it is hard to know usage for sure, there are a
bunch of ways that we can estimate:
-

How long has the API been in Spark?
-

Is the API common even for basic programs?
-

How often do we see recent questions in JIRA or mailing lists?
-

How often does it appear in StackOverflow or blogs?
-

Behavior after the break - How will a program that works today, work
after the break? The following are listed roughly in order of increasing
severity:
-

Will there be a compiler or linker error?
-

Will there be a runtime exception?
-

Will that exception happen after significant processing has been done?
-

Will we silently return different answers? (very hard to debug, might
not even notice!)

Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We
must also consider the cost both to the project and to our users of keeping
the API in question.

Project Costs - Every API we have needs to be tested and needs to keep
working as other parts of the project changes. These costs are
significantly exacerbated when external dependencies change (the JVM,
Scala, etc). In some cases, while not completely technically infeasible,
the cost of maintaining a particular API can become too high.
-

User Costs - APIs also have a cognitive cost to users learning Spark or
trying to understand Spark programs. This cost becomes even higher when the
API in question has confusing or undefined semantics.

Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also
high, there are alternatives that should be considered that do not hurt
existing users but do address some of the maintenance costs.

Avoid Bad APIs - While this is a bit obvious, it is an important point.
Anytime we are adding a new interface to Spark we should consider that we
might be stuck with this API forever. Think deeply about how new APIs
relate to existing ones, as well as how you expect them to evolve over time.
-

Deprecation Warnings - All deprecation warnings should point to a clear
alternative and should never just say that an API is deprecated.
-

Updated Docs - Documentation should point to the "best" recommended way
of performing a given task. In the cases where we maintain legacy
documentation, we should clearly point to newer APIs and suggest to users
the "right" way.
-

Community Work - Many people learn Spark by reading blogs and other
sites such as StackOverflow. However, many of these resources are out of
date. Update them, to reduce the cost of eventually removing deprecated
APIs.

Examples

Here are some examples of how I think the policy above could be applied to
different issues that have been discussed recently. These are only to
illustrate how to apply the above rubric, but are not intended to be part
of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
multiple creation of SparkContexts #23311
<https://github.com/apache/spark/pull/23311>

Cost to Break - Multiple Contexts in a single JVM never worked properly.
When users tried it they would nearly always report that Spark was broken (
SPARK-2243 <https://issues.apache.org/jira/browse/SPARK-2243>), due to
the confusing set of logs messages. Given this, I think it is very unlikely
that there are many real world use cases active today. Even those cases
likely suffer from undiagnosed issues as there are many areas of Spark that
assume a single context per JVM.
-

Cost to Maintain - We have recently had users ask on the mailing list if
this was supported, as the conf led them to believe it was, and the
existence of this configuration as "supported" makes it harder to reason
about certain global state in SparkContext.

Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921
<https://github.com/apache/spark/pull/22921/> (only looking at one API of
this PR)

Cost to Break - This is a wildly popular API of Spark SQL that has been
there since the first release. There are tons of blog posts and examples
that use this syntax if you google "dataframe registerTempTable
<https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
(even more than the "correct" API "dataframe createOrReplaceView
<https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
All of these will be invalid for users of Spark 3.0
-

Cost to Maintain - This is just an alias, so there is not a lot of extra
machinery required to keep the API. Users have two ways to do the same
thing, but we can note that this is just an alias in the docs.

Decision: Do not remove this API, I would even consider un-deprecating it.
I anecdotally asked several users and this is the API they prefer over the
"correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
<https://github.com/apache/spark/pull/24195>

Cost to Break - I think that this case actually exemplifies several
anti-patterns in breaking APIs. In some languages, the deprecation warning
gives you no help, other than what version the function was removed in. In
R, it points users to a really deep conversation on the semantics of time
in Spark SQL. None of the messages tell you how you should correctly be
parsing a timestamp that is given to you in a format other than UTC. My
guess is all users will blindly flip the flag to true (to keep using this
function), so you've only succeeded in annoying them.
-

Cost to Maintain - These are two relatively isolated expressions, there
should be little cost to keeping them. Users can be confused by their
semantics, so we probably should update the docs to point them to a best
practice (I learned only by complaining on the PR, that a good practice is
to parse timestamps including the timezone in the format expression, which
naturally shifts them to UTC).

Decision: Do not deprecate these two functions. We should update the docs
to talk about best practices for parsing timestamps, including how to
correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
<https://github.com/apache/spark/pull/24902>

Cost to Break - The TRIM function takes two string parameters. If we
switch the parameter order, queries that use the TRIM function would
silently get different results on different versions of Spark. Users may
not notice it for a long time and wrong query results may cause serious
problems to users.
-

Cost to Maintain - We will have some inconsistency inside Spark, as the
TRIM function in Scala API and in SQL have different parameter order.

Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM
srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a
warning, not by removing) the SQL TRIM function and move users to the SQL
standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and
amend this policy as the community sees fit.

Michael

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Sean Owen <sr...@apache.org>.

Those are all quite reasonable guidelines and I'd put them into the
contributing or developer guide, sure.
Although not argued here, I think we should go further than codifying
and enforcing common-sense guidelines like these. I think bias should
shift in favor of retaining APIs going forward, and even retroactively
shift for 3.0 somewhat. (Hence some reverts currently in progress.)
It's a natural evolution from 1.x to 2.x to 3.x. The API surface area
stops expanding and changing and getting fixed as much; years more
experience prove out what APIs make sense.

On Mon, Feb 24, 2020 at 5:03 PM Michael Armbrust <mi...@databricks.com> wrote:
>
> Hello Everyone,
>
>
> As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".
>
>
> As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.
>
>
> I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.
>
>
> Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.
>
>
> During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:
>
>
> I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.
>
>
> In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.
>
>
> As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.
>
>
> Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.
>
>
> Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>
> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>
> How long has the API been in Spark?
>
> Is the API common even for basic programs?
>
> How often do we see recent questions in JIRA or mailing lists?
>
> How often does it appear in StackOverflow or blogs?
>
> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>
> Will there be a compiler or linker error?
>
> Will there be a runtime exception?
>
> Will that exception happen after significant processing has been done?
>
> Will we silently return different answers? (very hard to debug, might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>
> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>
> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>
>
> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>
> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>
> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>
> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>
>
> Examples
>
>
> Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.
>
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311
>
>
> Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
>
> Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
>
> [SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)
>
>
> Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
>
> Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>
> Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
>
> Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.
>
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>
>
> Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
>
> Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.
>
>
> Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.
>
>
> Michael
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Jungtaek Lim <ka...@gmail.com>.

Xiao, thanks for the proposal and willingness to lead the effort!

I feel that it's still a bit different from what I've proposed. What I'm
proposing is closer to enforce discussion if the change proposes new public
API or brings breaking change. It's good that we add the section "Does this
PR introduce any user-facing change?" into the PR template (I'm not 100%
sure it's being used as its intention), but it doesn't enforce anything;
PRs containing breaking change are being reviewed and merged as same as
other PRs, no difference. Technically it can be merged in a couple of
hours, with only reviewed by one committer which doesn't seem to be enough
to decide it's good to go, IMHO.

I believe regular digest would be one step forward, as someone could notice
the change and jump in post-hoc review. One thing I'm a bit afraid of
post-hoc review is that it's not easy to expose concerns about already
merged things, especially if we have to revert. It makes both sides be
defensive; hesitate to do post-review, trying to defend the change we
already made. I'm big +1 to make one step further, but given we are
revisiting the policy, it would be nice if we revisit the policy of the
change of public API as well.

On Mon, Mar 9, 2020 at 2:39 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you all. Especially, the Audit efforts.
>
> Until now, the whole community has been working together in the same
> direction with the existing policy. It is always good.
>
> Since it seems that we are considering to have a new direction, I created
> an umbrella JIRA to track all activities.
>
>       https://issues.apache.org/jira/browse/SPARK-31085
>       Amend Spark's Semantic Versioning Policy
>
> As we know, the community-wide directional change always has a huge impact
> on daily PR reviews and regular releases. So, we had better consider all
> the reverting PRs as a normal independent PR instead of the follow-ups.
> Specifically, I believe we need the following.
>
>     1. Have new JIRA IDs instead of considering a simple revert or
> follow-up.
>         It's because we are not adding everything back blindly. For
> example,
>             https://issues.apache.org/jira/browse/SPARK-31089
>             "Add back ImageSchema.readImages in Spark 3.0"
>         is created and closed as 'Won't Do' with consideration between the
> trade-off.
>         We need to have a JIRA-issue-level history for this kind of
> request and the decision.
>
>     2. Sometime, as described by Michael, reverting is insufficient.
>         We need to provide a more fine-grained deprecation for users'
> safety case by case.
>
>     3. Given the timeline, newly added API should have a test coverage in
> the same PR from the beginning.
>         This is required because the whole reverting efforts aim to give a
> working API back.
>
> I believe that we have a good discussion in this thread.
> We are making a big change in Apache Spark history.
> Please be part of the history by taking actions like replying, voting, and
> reviewing.
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Yea, +1 on Jungtaek's suggestion; having the same strict policy for
>> adding new APIs looks nice.
>>
>> > When we making the API changes (e.g., adding the new APIs or changing
>> the existing APIs), we should regularly publish them in the dev list. I am
>> willing to lead this effort, work with my colleagues to summarize all the
>> merged commits [especially the API changes], and then send the *bi-weekly
>> digest *to the dev list
>>
>> This digest looks very helpful for the community, thanks, Xiao!
>>
>> Bests,
>> Takeshi
>>
>> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> I want to thank you *Ruifeng Zheng* publicly for his work that lists
>>> all the signature differences of Core, SQL and Hive we made in this
>>> upcoming release. For details, please read the files attached in
>>> SPARK-30982 <https://issues.apache.org/jira/browse/SPARK-30982>. I went
>>> over these files and submitted the following PRs to add back the SparkSQL
>>> APIs whose maintenance costs are low based on my own experiences in
>>> SparkSQL development:
>>>
>>>    - https://github.com/apache/spark/pull/27821
>>>    - functions.toDegrees/toRadians
>>>       - functions.approxCountDistinct
>>>       - functions.monotonicallyIncreasingId
>>>       - Column.!==
>>>       - Dataset.explode
>>>       - Dataset.registerTempTable
>>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>>    - https://github.com/apache/spark/pull/27815
>>>       - HiveContext
>>>       - createExternalTable APIs
>>>    -
>>>    - https://github.com/apache/spark/pull/27839
>>>       - SQLContext.applySchema
>>>       - SQLContext.parquetFile
>>>       - SQLContext.jsonFile
>>>       - SQLContext.jsonRDD
>>>       - SQLContext.load
>>>       - SQLContext.jdbc
>>>
>>> If you think these APIs should not be added back, let me know and we can
>>> discuss the items further. In general, I think we should provide more
>>> evidences and discuss them publicly when we dropping these APIs at the
>>> beginning.
>>>
>>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>>> the new APIs or changing the existing APIs), we should regularly publish
>>> them in the dev list. I am willing to lead this effort, work with my
>>> colleagues to summarize all the merged commits [especially the API
>>> changes], and then send the *bi-weekly digest *to the dev list. If you
>>> are willing to join this working group and help build these digests, feel
>>> free to send me a note [lixiao@databricks.com].
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道：
>>>
>>>> +1 for Sean as well.
>>>>
>>>> Moreover, as I added a voice on previous thread, if we want to be
>>>> strict with retaining public API, what we really need to do along with this
>>>> is having similar level or stricter of policy for adding public API. If we
>>>> don't apply the policy symmetrically, problems would go worse as it's still
>>>> not that hard to add public API (only require normal review) but once the
>>>> API is added and released it's going to be really hard to remove it.
>>>>
>>>> If we consider adding public API and deprecating/removing public API as
>>>> "critical" one for the project, IMHO, it would give better visibility and
>>>> open discussion if we make it going through dev@ mailing list instead
>>>> of directly filing a PR. As there're so many PRs being submitted it's
>>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>>> dev@ mailing list is not that crowded so less chance of missing the
>>>> critical changes, and not quickly decided by only a couple of committers.
>>>>
>>>> These suggestions would slow down the developments - that would make us
>>>> realize we may want to "classify/mark" user facing public APIs and others
>>>> (just exposed as public) and only apply all the policies to former. For
>>>> latter we don't need to guarantee anything.
>>>>
>>>>
>>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for Sean's concerns and questions.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> This thread established some good general principles, illustrated by
>>>>>> a few good examples. It didn't draw specific conclusions about what to add
>>>>>> back, which is why it wasn't at all controversial. What it means in
>>>>>> specific cases is where there may be disagreement, and that harder question
>>>>>> hasn't been addressed.
>>>>>>
>>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>>> there are several more going on now, some pretty broad. I am not even sure
>>>>>> what all of them are. In addition to below,
>>>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>>>> overhead to post to this thread any changes that one believes are endorsed
>>>>>> by these principles and perhaps a more strict interpretation of them now?
>>>>>> It's important enough we should get any data points or input, and now.
>>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>>> actually sounds like a good vehicle for that -- as long as people know
>>>>>> about them!
>>>>>>
>>>>>> Also, is there any usage data available to share? many arguments turn
>>>>>> around 'commonly used' but can we know that more concretely?
>>>>>>
>>>>>> Otherwise I think we'll back into implementing personal
>>>>>> interpretations of general principles, which is arguably the issue in the
>>>>>> first place, even when everyone believes in good faith in the same
>>>>>> principles.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>>> *well-known* virus.
>>>>>>> Can we finalize this first before doing unofficial personal
>>>>>>> decisions?
>>>>>>> Technically, this thread was not a vote and our website doesn't have
>>>>>>> a clear policy yet.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27835
>>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>>>> operands"
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27834
>>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <
>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>> >>     1. Could you estimate how many revert commits are required
>>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>>
>>>>>>>>> Fair question about what actual change this implies for 3.0? so
>>>>>>>>> far it
>>>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>>>> deprecated ones?
>>>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>>>
>>>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>>>> happening now)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>>>> Spark Summit 2020)
>>>>>>>>> >
>>>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>>>> especially given our current preview releases being available to gather
>>>>>>>>> community feedback.
>>>>>>>>>
>>>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>>>> finishing in a month or two.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>>>> following tough parts sure.
>>>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>>>> > I hope not.
>>>>>>>>> >>
>>>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>>>> 3.1+, right?
>>>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>>>> elsewhere.
>>>>>>>>>
>>>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>>>> could be interpreted in many different ways depending on how much
>>>>>>>>> you
>>>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>>>> simplifying
>>>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>>>> judgment calls, based on necessarily limited data about the
>>>>>>>>> universe
>>>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>>>> general
>>>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>>>
>>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I
>>>>>>>>> do
>>>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>>>> for example.
>>>>>>>>>
>>>>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you all. Especially, the Audit efforts.

Until now, the whole community has been working together in the same
direction with the existing policy. It is always good.

Since it seems that we are considering to have a new direction, I created
an umbrella JIRA to track all activities.

      https://issues.apache.org/jira/browse/SPARK-31085
      Amend Spark's Semantic Versioning Policy

As we know, the community-wide directional change always has a huge impact
on daily PR reviews and regular releases. So, we had better consider all
the reverting PRs as a normal independent PR instead of the follow-ups.
Specifically, I believe we need the following.

    1. Have new JIRA IDs instead of considering a simple revert or
follow-up.
        It's because we are not adding everything back blindly. For example,
            https://issues.apache.org/jira/browse/SPARK-31089
            "Add back ImageSchema.readImages in Spark 3.0"
        is created and closed as 'Won't Do' with consideration between the
trade-off.
        We need to have a JIRA-issue-level history for this kind of request
and the decision.

    2. Sometime, as described by Michael, reverting is insufficient.
        We need to provide a more fine-grained deprecation for users'
safety case by case.

    3. Given the timeline, newly added API should have a test coverage in
the same PR from the beginning.
        This is required because the whole reverting efforts aim to give a
working API back.

I believe that we have a good discussion in this thread.
We are making a big change in Apache Spark history.
Please be part of the history by taking actions like replying, voting, and
reviewing.

Thanks,
Dongjoon.


On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <li...@gmail.com>
wrote:

> Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
> new APIs looks nice.
>
> > When we making the API changes (e.g., adding the new APIs or changing
> the existing APIs), we should regularly publish them in the dev list. I am
> willing to lead this effort, work with my colleagues to summarize all the
> merged commits [especially the API changes], and then send the *bi-weekly
> digest *to the dev list
>
> This digest looks very helpful for the community, thanks, Xiao!
>
> Bests,
> Takeshi
>
> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:
>
>> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
>> the signature differences of Core, SQL and Hive we made in this upcoming
>> release. For details, please read the files attached in SPARK-30982
>> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
>> files and submitted the following PRs to add back the SparkSQL APIs whose
>> maintenance costs are low based on my own experiences in SparkSQL
>> development:
>>
>>    - https://github.com/apache/spark/pull/27821
>>    - functions.toDegrees/toRadians
>>       - functions.approxCountDistinct
>>       - functions.monotonicallyIncreasingId
>>       - Column.!==
>>       - Dataset.explode
>>       - Dataset.registerTempTable
>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>    - https://github.com/apache/spark/pull/27815
>>       - HiveContext
>>       - createExternalTable APIs
>>    -
>>    - https://github.com/apache/spark/pull/27839
>>       - SQLContext.applySchema
>>       - SQLContext.parquetFile
>>       - SQLContext.jsonFile
>>       - SQLContext.jsonRDD
>>       - SQLContext.load
>>       - SQLContext.jdbc
>>
>> If you think these APIs should not be added back, let me know and we can
>> discuss the items further. In general, I think we should provide more
>> evidences and discuss them publicly when we dropping these APIs at the
>> beginning.
>>
>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>> the new APIs or changing the existing APIs), we should regularly publish
>> them in the dev list. I am willing to lead this effort, work with my
>> colleagues to summarize all the merged commits [especially the API
>> changes], and then send the *bi-weekly digest *to the dev list. If you
>> are willing to join this working group and help build these digests, feel
>> free to send me a note [lixiao@databricks.com].
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道：
>>
>>> +1 for Sean as well.
>>>
>>> Moreover, as I added a voice on previous thread, if we want to be strict
>>> with retaining public API, what we really need to do along with this is
>>> having similar level or stricter of policy for adding public API. If we
>>> don't apply the policy symmetrically, problems would go worse as it's still
>>> not that hard to add public API (only require normal review) but once the
>>> API is added and released it's going to be really hard to remove it.
>>>
>>> If we consider adding public API and deprecating/removing public API as
>>> "critical" one for the project, IMHO, it would give better visibility and
>>> open discussion if we make it going through dev@ mailing list instead
>>> of directly filing a PR. As there're so many PRs being submitted it's
>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>> dev@ mailing list is not that crowded so less chance of missing the
>>> critical changes, and not quickly decided by only a couple of committers.
>>>
>>> These suggestions would slow down the developments - that would make us
>>> realize we may want to "classify/mark" user facing public APIs and others
>>> (just exposed as public) and only apply all the policies to former. For
>>> latter we don't need to guarantee anything.
>>>
>>>
>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> +1 for Sean's concerns and questions.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> This thread established some good general principles, illustrated by a
>>>>> few good examples. It didn't draw specific conclusions about what to add
>>>>> back, which is why it wasn't at all controversial. What it means in
>>>>> specific cases is where there may be disagreement, and that harder question
>>>>> hasn't been addressed.
>>>>>
>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>> there are several more going on now, some pretty broad. I am not even sure
>>>>> what all of them are. In addition to below,
>>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>>> overhead to post to this thread any changes that one believes are endorsed
>>>>> by these principles and perhaps a more strict interpretation of them now?
>>>>> It's important enough we should get any data points or input, and now.
>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>> actually sounds like a good vehicle for that -- as long as people know
>>>>> about them!
>>>>>
>>>>> Also, is there any usage data available to share? many arguments turn
>>>>> around 'commonly used' but can we know that more concretely?
>>>>>
>>>>> Otherwise I think we'll back into implementing personal
>>>>> interpretations of general principles, which is arguably the issue in the
>>>>> first place, even when everyone believes in good faith in the same
>>>>> principles.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>> *well-known* virus.
>>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>>> Technically, this thread was not a vote and our website doesn't have
>>>>>> a clear policy yet.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27835
>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>>> operands"
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27834
>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>> >>     1. Could you estimate how many revert commits are required
>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>
>>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>>> it
>>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>>> deprecated ones?
>>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>>
>>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>>> happening now)
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>>> Spark Summit 2020)
>>>>>>>> >
>>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>>> especially given our current preview releases being available to gather
>>>>>>>> community feedback.
>>>>>>>>
>>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>>> finishing in a month or two.
>>>>>>>>
>>>>>>>>
>>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>>> following tough parts sure.
>>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>>> > I hope not.
>>>>>>>> >>
>>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>>> 3.1+, right?
>>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>>> elsewhere.
>>>>>>>>
>>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>>> could be interpreted in many different ways depending on how much
>>>>>>>> you
>>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>>> simplifying
>>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>>> general
>>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>>
>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>>> for example.
>>>>>>>>
>>>>>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Takeshi Yamamuro <li...@gmail.com>.

Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
new APIs looks nice.

> When we making the API changes (e.g., adding the new APIs or changing the
existing APIs), we should regularly publish them in the dev list. I am
willing to lead this effort, work with my colleagues to summarize all the
merged commits [especially the API changes], and then send the *bi-weekly
digest *to the dev list

This digest looks very helpful for the community, thanks, Xiao!

Bests,
Takeshi

On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:

> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
> the signature differences of Core, SQL and Hive we made in this upcoming
> release. For details, please read the files attached in SPARK-30982
> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
> files and submitted the following PRs to add back the SparkSQL APIs whose
> maintenance costs are low based on my own experiences in SparkSQL
> development:
>
>    - https://github.com/apache/spark/pull/27821
>    - functions.toDegrees/toRadians
>       - functions.approxCountDistinct
>       - functions.monotonicallyIncreasingId
>       - Column.!==
>       - Dataset.explode
>       - Dataset.registerTempTable
>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>    - https://github.com/apache/spark/pull/27815
>       - HiveContext
>       - createExternalTable APIs
>    -
>    - https://github.com/apache/spark/pull/27839
>       - SQLContext.applySchema
>       - SQLContext.parquetFile
>       - SQLContext.jsonFile
>       - SQLContext.jsonRDD
>       - SQLContext.load
>       - SQLContext.jdbc
>
> If you think these APIs should not be added back, let me know and we can
> discuss the items further. In general, I think we should provide more
> evidences and discuss them publicly when we dropping these APIs at the
> beginning.
>
> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
> the new APIs or changing the existing APIs), we should regularly publish
> them in the dev list. I am willing to lead this effort, work with my
> colleagues to summarize all the merged commits [especially the API
> changes], and then send the *bi-weekly digest *to the dev list. If you
> are willing to join this working group and help build these digests, feel
> free to send me a note [lixiao@databricks.com].
>
> Cheers,
>
> Xiao
>
>
>
>
> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道：
>
>> +1 for Sean as well.
>>
>> Moreover, as I added a voice on previous thread, if we want to be strict
>> with retaining public API, what we really need to do along with this is
>> having similar level or stricter of policy for adding public API. If we
>> don't apply the policy symmetrically, problems would go worse as it's still
>> not that hard to add public API (only require normal review) but once the
>> API is added and released it's going to be really hard to remove it.
>>
>> If we consider adding public API and deprecating/removing public API as
>> "critical" one for the project, IMHO, it would give better visibility and
>> open discussion if we make it going through dev@ mailing list instead of
>> directly filing a PR. As there're so many PRs being submitted it's nearly
>> impossible to look into all of PRs - it may require us to "watch" the repo
>> and have tons of mails. Compared to the popularity on Github PRs, dev@
>> mailing list is not that crowded so less chance of missing the critical
>> changes, and not quickly decided by only a couple of committers.
>>
>> These suggestions would slow down the developments - that would make us
>> realize we may want to "classify/mark" user facing public APIs and others
>> (just exposed as public) and only apply all the policies to former. For
>> latter we don't need to guarantee anything.
>>
>>
>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> +1 for Sean's concerns and questions.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> This thread established some good general principles, illustrated by a
>>>> few good examples. It didn't draw specific conclusions about what to add
>>>> back, which is why it wasn't at all controversial. What it means in
>>>> specific cases is where there may be disagreement, and that harder question
>>>> hasn't been addressed.
>>>>
>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>> there are several more going on now, some pretty broad. I am not even sure
>>>> what all of them are. In addition to below,
>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>> overhead to post to this thread any changes that one believes are endorsed
>>>> by these principles and perhaps a more strict interpretation of them now?
>>>> It's important enough we should get any data points or input, and now.
>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>> actually sounds like a good vehicle for that -- as long as people know
>>>> about them!
>>>>
>>>> Also, is there any usage data available to share? many arguments turn
>>>> around 'commonly used' but can we know that more concretely?
>>>>
>>>> Otherwise I think we'll back into implementing personal interpretations
>>>> of general principles, which is arguably the issue in the first place, even
>>>> when everyone believes in good faith in the same principles.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>>> virus.
>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>> Technically, this thread was not a vote and our website doesn't have a
>>>>> clear policy yet.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>
>>>>> https://github.com/apache/spark/pull/27835
>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>> operands"
>>>>>
>>>>> https://github.com/apache/spark/pull/27834
>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>>> `branch-3.0` for new rubric?
>>>>>>>
>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>> it
>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>
>>>>>>>
>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>> deprecated ones?
>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>
>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>> happening now)
>>>>>>>
>>>>>>>
>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>> Spark Summit 2020)
>>>>>>> >
>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>> especially given our current preview releases being available to gather
>>>>>>> community feedback.
>>>>>>>
>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>> finishing in a month or two.
>>>>>>>
>>>>>>>
>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>> following tough parts sure.
>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>> > I hope not.
>>>>>>> >>
>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>> 3.1+, right?
>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>> elsewhere.
>>>>>>>
>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>> could be interpreted in many different ways depending on how much you
>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>> simplifying
>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>> general
>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>
>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>> for example.
>>>>>>>
>>>>>>

-- 
---
Takeshi Yamamuro

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Xiao Li <ga...@gmail.com>.

 I want to thank you *Ruifeng Zheng* publicly for his work that lists all
the signature differences of Core, SQL and Hive we made in this upcoming
release. For details, please read the files attached in SPARK-30982
<https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
files and submitted the following PRs to add back the SparkSQL APIs whose
maintenance costs are low based on my own experiences in SparkSQL
development:

   - https://github.com/apache/spark/pull/27821
   - functions.toDegrees/toRadians
      - functions.approxCountDistinct
      - functions.monotonicallyIncreasingId
      - Column.!==
      - Dataset.explode
      - Dataset.registerTempTable
      - SQLContext.getOrCreate, setActive, clearActive, constructors
   - https://github.com/apache/spark/pull/27815
      - HiveContext
      - createExternalTable APIs
   -
   - https://github.com/apache/spark/pull/27839
      - SQLContext.applySchema
      - SQLContext.parquetFile
      - SQLContext.jsonFile
      - SQLContext.jsonRDD
      - SQLContext.load
      - SQLContext.jdbc

If you think these APIs should not be added back, let me know and we can
discuss the items further. In general, I think we should provide more
evidences and discuss them publicly when we dropping these APIs at the
beginning.

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the
new APIs or changing the existing APIs), we should regularly publish them
in the dev list. I am willing to lead this effort, work with my colleagues
to summarize all the merged commits [especially the API changes], and then
send the *bi-weekly digest *to the dev list. If you are willing to join
this working group and help build these digests, feel free to send me a
note [lixiao@databricks.com].

Cheers,

Xiao




Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道：

> +1 for Sean as well.
>
> Moreover, as I added a voice on previous thread, if we want to be strict
> with retaining public API, what we really need to do along with this is
> having similar level or stricter of policy for adding public API. If we
> don't apply the policy symmetrically, problems would go worse as it's still
> not that hard to add public API (only require normal review) but once the
> API is added and released it's going to be really hard to remove it.
>
> If we consider adding public API and deprecating/removing public API as
> "critical" one for the project, IMHO, it would give better visibility and
> open discussion if we make it going through dev@ mailing list instead of
> directly filing a PR. As there're so many PRs being submitted it's nearly
> impossible to look into all of PRs - it may require us to "watch" the repo
> and have tons of mails. Compared to the popularity on Github PRs, dev@
> mailing list is not that crowded so less chance of missing the critical
> changes, and not quickly decided by only a couple of committers.
>
> These suggestions would slow down the developments - that would make us
> realize we may want to "classify/mark" user facing public APIs and others
> (just exposed as public) and only apply all the policies to former. For
> latter we don't need to guarantee anything.
>
>
> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> +1 for Sean's concerns and questions.
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> This thread established some good general principles, illustrated by a
>>> few good examples. It didn't draw specific conclusions about what to add
>>> back, which is why it wasn't at all controversial. What it means in
>>> specific cases is where there may be disagreement, and that harder question
>>> hasn't been addressed.
>>>
>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>> there are several more going on now, some pretty broad. I am not even sure
>>> what all of them are. In addition to below,
>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>> overhead to post to this thread any changes that one believes are endorsed
>>> by these principles and perhaps a more strict interpretation of them now?
>>> It's important enough we should get any data points or input, and now.
>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>> actually sounds like a good vehicle for that -- as long as people know
>>> about them!
>>>
>>> Also, is there any usage data available to share? many arguments turn
>>> around 'commonly used' but can we know that more concretely?
>>>
>>> Otherwise I think we'll back into implementing personal interpretations
>>> of general principles, which is arguably the issue in the first place, even
>>> when everyone believes in good faith in the same principles.
>>>
>>>
>>>
>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>> virus.
>>>> Can we finalize this first before doing unofficial personal decisions?
>>>> Technically, this thread was not a vote and our website doesn't have a
>>>> clear policy yet.
>>>>
>>>> https://github.com/apache/spark/pull/27821
>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>     ==> This technically revert most of the SPARK-25908.
>>>>
>>>> https://github.com/apache/spark/pull/27835
>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>> operands"
>>>>
>>>> https://github.com/apache/spark/pull/27834
>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>> `branch-3.0` for new rubric?
>>>>>>
>>>>>> Fair question about what actual change this implies for 3.0? so far it
>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>
>>>>>>
>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>> deprecated ones?
>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>
>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>> happening now)
>>>>>>
>>>>>>
>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>> Spark Summit 2020)
>>>>>> >
>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>> especially given our current preview releases being available to gather
>>>>>> community feedback.
>>>>>>
>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>> finishing in a month or two.
>>>>>>
>>>>>>
>>>>>> >> Although there was a discussion already, I want to make the
>>>>>> following tough parts sure.
>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>> > I hope not.
>>>>>> >>
>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>> 3.1+, right?
>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>> elsewhere.
>>>>>>
>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>> could be interpreted in many different ways depending on how much you
>>>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>
>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>> for example.
>>>>>>
>>>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Jungtaek Lim <ka...@gmail.com>.

+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict
with retaining public API, what we really need to do along with this is
having similar level or stricter of policy for adding public API. If we
don't apply the policy symmetrically, problems would go worse as it's still
not that hard to add public API (only require normal review) but once the
API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as
"critical" one for the project, IMHO, it would give better visibility and
open discussion if we make it going through dev@ mailing list instead of
directly filing a PR. As there're so many PRs being submitted it's nearly
impossible to look into all of PRs - it may require us to "watch" the repo
and have tons of mails. Compared to the popularity on Github PRs, dev@
mailing list is not that crowded so less chance of missing the critical
changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us
realize we may want to "classify/mark" user facing public APIs and others
(just exposed as public) and only apply all the policies to former. For
latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> +1 for Sean's concerns and questions.
>
> Bests,
> Dongjoon.
>
> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>
>> This thread established some good general principles, illustrated by a
>> few good examples. It didn't draw specific conclusions about what to add
>> back, which is why it wasn't at all controversial. What it means in
>> specific cases is where there may be disagreement, and that harder question
>> hasn't been addressed.
>>
>> The reverts I have seen so far seemed like the obvious one, but yes,
>> there are several more going on now, some pretty broad. I am not even sure
>> what all of them are. In addition to below,
>> https://github.com/apache/spark/pull/27839. Would it be too much
>> overhead to post to this thread any changes that one believes are endorsed
>> by these principles and perhaps a more strict interpretation of them now?
>> It's important enough we should get any data points or input, and now.
>> (We're obviously not going to debate each one.) A draft PR, or several,
>> actually sounds like a good vehicle for that -- as long as people know
>> about them!
>>
>> Also, is there any usage data available to share? many arguments turn
>> around 'commonly used' but can we know that more concretely?
>>
>> Otherwise I think we'll back into implementing personal interpretations
>> of general principles, which is arguably the issue in the first place, even
>> when everyone believes in good faith in the same principles.
>>
>>
>>
>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>> virus.
>>> Can we finalize this first before doing unofficial personal decisions?
>>> Technically, this thread was not a vote and our website doesn't have a
>>> clear policy yet.
>>>
>>> https://github.com/apache/spark/pull/27821
>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>     ==> This technically revert most of the SPARK-25908.
>>>
>>> https://github.com/apache/spark/pull/27835
>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>> operands"
>>>
>>> https://github.com/apache/spark/pull/27834
>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> There is a on-going Xiao's PR referencing this email.
>>>>
>>>> https://github.com/apache/spark/pull/27821
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>> `branch-3.0` for new rubric?
>>>>>
>>>>> Fair question about what actual change this implies for 3.0? so far it
>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>> anyone's suggesting reverting loads of changes.
>>>>>
>>>>>
>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>> deprecated ones?
>>>>> > This is a good point, making sure we keep the tests as well is
>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>
>>>>> (I'd say, yes of course! which seems consistent with what is happening
>>>>> now)
>>>>>
>>>>>
>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>> >>         (I believe it was previously scheduled on June before Spark
>>>>> Summit 2020)
>>>>> >
>>>>> > I think if we need to delay to make a better release this is ok,
>>>>> especially given our current preview releases being available to gather
>>>>> community feedback.
>>>>>
>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>> finishing in a month or two.
>>>>>
>>>>>
>>>>> >> Although there was a discussion already, I want to make the
>>>>> following tough parts sure.
>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>> > I hope not.
>>>>> >>
>>>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>>>> right?
>>>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>>>
>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>> could be interpreted in many different ways depending on how much you
>>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>> judgment calls, based on necessarily limited data about the universe
>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>
>>>>> My specific interpretation is that the standard is (correctly)
>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>> not think anyone is advocating for the logical extreme of, for
>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>> for example.
>>>>>
>>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:

> This thread established some good general principles, illustrated by a few
> good examples. It didn't draw specific conclusions about what to add back,
> which is why it wasn't at all controversial. What it means in specific
> cases is where there may be disagreement, and that harder question hasn't
> been addressed.
>
> The reverts I have seen so far seemed like the obvious one, but yes, there
> are several more going on now, some pretty broad. I am not even sure what
> all of them are. In addition to below,
> https://github.com/apache/spark/pull/27839. Would it be too much overhead
> to post to this thread any changes that one believes are endorsed by these
> principles and perhaps a more strict interpretation of them now? It's
> important enough we should get any data points or input, and now. (We're
> obviously not going to debate each one.) A draft PR, or several, actually
> sounds like a good vehicle for that -- as long as people know about them!
>
> Also, is there any usage data available to share? many arguments turn
> around 'commonly used' but can we know that more concretely?
>
> Otherwise I think we'll back into implementing personal interpretations of
> general principles, which is arguably the issue in the first place, even
> when everyone believes in good faith in the same principles.
>
>
>
> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Recently, reverting PRs seems to start to spread like the *well-known*
>> virus.
>> Can we finalize this first before doing unofficial personal decisions?
>> Technically, this thread was not a vote and our website doesn't have a
>> clear policy yet.
>>
>> https://github.com/apache/spark/pull/27821
>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>     ==> This technically revert most of the SPARK-25908.
>>
>> https://github.com/apache/spark/pull/27835
>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>> operands"
>>
>> https://github.com/apache/spark/pull/27834
>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> There is a on-going Xiao's PR referencing this email.
>>>
>>> https://github.com/apache/spark/pull/27821
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>> >>     1. Could you estimate how many revert commits are required in
>>>> `branch-3.0` for new rubric?
>>>>
>>>> Fair question about what actual change this implies for 3.0? so far it
>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>> anyone's suggesting reverting loads of changes.
>>>>
>>>>
>>>> >>     2. Are you going to revert all removed test cases for the
>>>> deprecated ones?
>>>> > This is a good point, making sure we keep the tests as well is
>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>
>>>> (I'd say, yes of course! which seems consistent with what is happening
>>>> now)
>>>>
>>>>
>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>> >>         (I believe it was previously scheduled on June before Spark
>>>> Summit 2020)
>>>> >
>>>> > I think if we need to delay to make a better release this is ok,
>>>> especially given our current preview releases being available to gather
>>>> community feedback.
>>>>
>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>> finishing in a month or two.
>>>>
>>>>
>>>> >> Although there was a discussion already, I want to make the
>>>> following tough parts sure.
>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>> > I hope not.
>>>> >>
>>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>>> right?
>>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>>
>>>> Yeah this is an important subtext -- the valuable principles here
>>>> could be interpreted in many different ways depending on how much you
>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>> judgment calls, based on necessarily limited data about the universe
>>>> of users. We can only go on rare direct user feedback, on feedback
>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>> good faith judgment of committers who have lived Spark for years.
>>>>
>>>> My specific interpretation is that the standard is (correctly)
>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>> not think anyone is advocating for the logical extreme of, for
>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>> that falls out readily from the rubric here: maintaining 2.11
>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>> for example.
>>>>
>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Sean Owen <sr...@gmail.com>.

This thread established some good general principles, illustrated by a few
good examples. It didn't draw specific conclusions about what to add back,
which is why it wasn't at all controversial. What it means in specific
cases is where there may be disagreement, and that harder question hasn't
been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there
are several more going on now, some pretty broad. I am not even sure what
all of them are. In addition to below,
https://github.com/apache/spark/pull/27839. Would it be too much overhead
to post to this thread any changes that one believes are endorsed by these
principles and perhaps a more strict interpretation of them now? It's
important enough we should get any data points or input, and now. (We're
obviously not going to debate each one.) A draft PR, or several, actually
sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn
around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of
general principles, which is arguably the issue in the first place, even
when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Recently, reverting PRs seems to start to spread like the *well-known*
> virus.
> Can we finalize this first before doing unofficial personal decisions?
> Technically, this thread was not a vote and our website doesn't have a
> clear policy yet.
>
> https://github.com/apache/spark/pull/27821
> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>     ==> This technically revert most of the SPARK-25908.
>
> https://github.com/apache/spark/pull/27835
> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
> operands"
>
> https://github.com/apache/spark/pull/27834
> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>
> Bests,
> Dongjoon.
>
> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> There is a on-going Xiao's PR referencing this email.
>>
>> https://github.com/apache/spark/pull/27821
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>> >>     1. Could you estimate how many revert commits are required in
>>> `branch-3.0` for new rubric?
>>>
>>> Fair question about what actual change this implies for 3.0? so far it
>>> seems like some targeted, quite reasonable reverts. I don't think
>>> anyone's suggesting reverting loads of changes.
>>>
>>>
>>> >>     2. Are you going to revert all removed test cases for the
>>> deprecated ones?
>>> > This is a good point, making sure we keep the tests as well is
>>> important (worse than removing a deprecated API is shipping it broken),.
>>>
>>> (I'd say, yes of course! which seems consistent with what is happening
>>> now)
>>>
>>>
>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>> >>         (I believe it was previously scheduled on June before Spark
>>> Summit 2020)
>>> >
>>> > I think if we need to delay to make a better release this is ok,
>>> especially given our current preview releases being available to gather
>>> community feedback.
>>>
>>> Of course these things block 3.0 -- all the more reason to keep it
>>> specific and targeted -- but nothing so far seems inconsistent with
>>> finishing in a month or two.
>>>
>>>
>>> >> Although there was a discussion already, I want to make the following
>>> tough parts sure.
>>> >>     4. We are not going to add Scala 2.11 API, right?
>>> > I hope not.
>>> >>
>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>> right?
>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>
>>> Yeah this is an important subtext -- the valuable principles here
>>> could be interpreted in many different ways depending on how much you
>>> weight value of keeping APIs for compatibility vs value in simplifying
>>> Spark and pushing users to newer APIs more forcibly. They're all
>>> judgment calls, based on necessarily limited data about the universe
>>> of users. We can only go on rare direct user feedback, on feedback
>>> perhaps from vendors as proxies for a subset of users, and the general
>>> good faith judgment of committers who have lived Spark for years.
>>>
>>> My specific interpretation is that the standard is (correctly)
>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>> not think anyone is advocating for the logical extreme of, for
>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>> that falls out readily from the rubric here: maintaining 2.11
>>> compatibility is really quite painful if you ever support 2.13 too,
>>> for example.
>>>
>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known*
virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a
clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> There is a on-going Xiao's PR referencing this email.
>
> https://github.com/apache/spark/pull/27821
>
> Bests,
> Dongjoon.
>
> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>
>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>> >>     1. Could you estimate how many revert commits are required in
>> `branch-3.0` for new rubric?
>>
>> Fair question about what actual change this implies for 3.0? so far it
>> seems like some targeted, quite reasonable reverts. I don't think
>> anyone's suggesting reverting loads of changes.
>>
>>
>> >>     2. Are you going to revert all removed test cases for the
>> deprecated ones?
>> > This is a good point, making sure we keep the tests as well is
>> important (worse than removing a deprecated API is shipping it broken),.
>>
>> (I'd say, yes of course! which seems consistent with what is happening
>> now)
>>
>>
>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>> >>         (I believe it was previously scheduled on June before Spark
>> Summit 2020)
>> >
>> > I think if we need to delay to make a better release this is ok,
>> especially given our current preview releases being available to gather
>> community feedback.
>>
>> Of course these things block 3.0 -- all the more reason to keep it
>> specific and targeted -- but nothing so far seems inconsistent with
>> finishing in a month or two.
>>
>>
>> >> Although there was a discussion already, I want to make the following
>> tough parts sure.
>> >>     4. We are not going to add Scala 2.11 API, right?
>> > I hope not.
>> >>
>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>> right?
>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>
>> Yeah this is an important subtext -- the valuable principles here
>> could be interpreted in many different ways depending on how much you
>> weight value of keeping APIs for compatibility vs value in simplifying
>> Spark and pushing users to newer APIs more forcibly. They're all
>> judgment calls, based on necessarily limited data about the universe
>> of users. We can only go on rare direct user feedback, on feedback
>> perhaps from vendors as proxies for a subset of users, and the general
>> good faith judgment of committers who have lived Spark for years.
>>
>> My specific interpretation is that the standard is (correctly)
>> tightening going forward, and retroactively a bit for 3.0. But, I do
>> not think anyone is advocating for the logical extreme of, for
>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>> that falls out readily from the rubric here: maintaining 2.11
>> compatibility is really quite painful if you ever support 2.13 too,
>> for example.
>>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, All.

There is a on-going Xiao's PR referencing this email.

https://github.com/apache/spark/pull/27821

Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:

> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
> wrote:
> >>     1. Could you estimate how many revert commits are required in
> `branch-3.0` for new rubric?
>
> Fair question about what actual change this implies for 3.0? so far it
> seems like some targeted, quite reasonable reverts. I don't think
> anyone's suggesting reverting loads of changes.
>
>
> >>     2. Are you going to revert all removed test cases for the
> deprecated ones?
> > This is a good point, making sure we keep the tests as well is important
> (worse than removing a deprecated API is shipping it broken),.
>
> (I'd say, yes of course! which seems consistent with what is happening now)
>
>
> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
> >>         (I believe it was previously scheduled on June before Spark
> Summit 2020)
> >
> > I think if we need to delay to make a better release this is ok,
> especially given our current preview releases being available to gather
> community feedback.
>
> Of course these things block 3.0 -- all the more reason to keep it
> specific and targeted -- but nothing so far seems inconsistent with
> finishing in a month or two.
>
>
> >> Although there was a discussion already, I want to make the following
> tough parts sure.
> >>     4. We are not going to add Scala 2.11 API, right?
> > I hope not.
> >>
> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
> right?
> > I think doing that would be bad, it's already end of lifed elsewhere.
>
> Yeah this is an important subtext -- the valuable principles here
> could be interpreted in many different ways depending on how much you
> weight value of keeping APIs for compatibility vs value in simplifying
> Spark and pushing users to newer APIs more forcibly. They're all
> judgment calls, based on necessarily limited data about the universe
> of users. We can only go on rare direct user feedback, on feedback
> perhaps from vendors as proxies for a subset of users, and the general
> good faith judgment of committers who have lived Spark for years.
>
> My specific interpretation is that the standard is (correctly)
> tightening going forward, and retroactively a bit for 3.0. But, I do
> not think anyone is advocating for the logical extreme of, for
> example, maintaining Scala 2.11 compatibility indefinitely. I think
> that falls out readily from the rubric here: maintaining 2.11
> compatibility is really quite painful if you ever support 2.13 too,
> for example.
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Sean Owen <sr...@gmail.com>.

On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>>     1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric?

Fair question about what actual change this implies for 3.0? so far it
seems like some targeted, quite reasonable reverts. I don't think
anyone's suggesting reverting loads of changes.

>>     2. Are you going to revert all removed test cases for the deprecated ones?
> This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),.

(I'd say, yes of course! which seems consistent with what is happening now)

>>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>         (I believe it was previously scheduled on June before Spark Summit 2020)
>
> I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback.

Of course these things block 3.0 -- all the more reason to keep it
specific and targeted -- but nothing so far seems inconsistent with
finishing in a month or two.

>> Although there was a discussion already, I want to make the following tough parts sure.
>>     4. We are not going to add Scala 2.11 API, right?
> I hope not.
>>
>>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
> I think doing that would be bad, it's already end of lifed elsewhere.

Yeah this is an important subtext -- the valuable principles here
could be interpreted in many different ways depending on how much you
weight value of keeping APIs for compatibility vs value in simplifying
Spark and pushing users to newer APIs more forcibly. They're all
judgment calls, based on necessarily limited data about the universe
of users. We can only go on rare direct user feedback, on feedback
perhaps from vendors as proxies for a subset of users, and the general
good faith judgment of committers who have lived Spark for years.

My specific interpretation is that the standard is (correctly)
tightening going forward, and retroactively a bit for 3.0. But, I do
not think anyone is advocating for the logical extreme of, for
example, maintaining Scala 2.11 compatibility indefinitely. I think
that falls out readily from the rubric here: maintaining 2.11
compatibility is really quite painful if you ever support 2.13 too,
for example.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Holden Karau <ho...@pigscanfly.ca>.

On Fri, Feb 28, 2020 at 9:48 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Matei and Michael.
>
> I'm also a big supporter for policy-based project management.
>
> Before going further,
>
>     1. Could you estimate how many revert commits are required in
> `branch-3.0` for new rubric?
>     2. Are you going to revert all removed test cases for the deprecated
> ones?
>
This is a good point, making sure we keep the tests as well is important
(worse than removing a deprecated API is shipping it broken),.

>     3. Does it make any delay for Apache Spark 3.0.0 release?
>         (I believe it was previously scheduled on June before Spark Summit
> 2020)
>
I think if we need to delay to make a better release this is ok, especially
given our current preview releases being available to gather community
feedback.

>
> Although there was a discussion already, I want to make the following
> tough parts sure.
>
>     4. We are not going to add Scala 2.11 API, right?
>
I hope not.

>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
>
I think doing that would be bad, it's already end of lifed elsewhere.

>     6. Do we have enough resource for testing the deprecated ones?
>         (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)
>
> Especially, for (2) and (6), we know that keeping deprecated ones without
> testings doesn't give us any support for the new rubric.
>
> Bests,
> Dongjoon.
>
> On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> +1 on this new rubric. It definitely captures the issues I’ve seen in
>> Spark and in other projects. If we write down this rubric (or something
>> like it), it will also be easier to refer to it during code reviews or in
>> proposals of new APIs (we could ask “do you expect to have to change this
>> API in the future, and if so, how”).
>>
>> Matei
>>
>> On Feb 24, 2020, at 3:02 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>> Hello Everyone,
>>
>> As more users have started upgrading to Spark 3.0 preview (including
>> myself), there have been many discussions around APIs that have been broken
>> compared with Spark 2.x. In many of these discussions, one of the
>> rationales for breaking an API seems to be "Spark follows semantic
>> versioning <https://spark.apache.org/versioning-policy.html>, so this
>> major release is our chance to get it right [by breaking APIs]". Similarly,
>> in many cases the response to questions about why an API was completely
>> removed has been, "this API has been deprecated since x.x, so we have to
>> remove it".
>>
>> As a long time contributor to and user of Spark this interpretation of
>> the policy is concerning to me. This reasoning misses the intention of the
>> original policy, and I am worried that it will hurt the long-term success
>> of the project.
>>
>> I definitely understand that these are hard decisions, and I'm not
>> proposing that we never remove anything from Spark. However, I would like
>> to give some additional context and also propose a different rubric for
>> thinking about API breakage moving forward.
>>
>> Spark adopted semantic versioning back in 2014 during the preparations
>> for the 1.0 release. As this was the first major release -- and as, up
>> until fairly recently, Spark had only been an academic project -- no real
>> promises had been made about API stability ever.
>>
>> During the discussion, some committers suggested that this was an
>> opportunity to clean up cruft and give the Spark APIs a once-over, making
>> cosmetic changes to improve consistency. However, in the end, it was
>> decided that in many cases it was not in the best interests of the Spark
>> community to break things just because we could. Matei actually said it
>> pretty forcefully
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
>> :
>>
>> I know that some names are suboptimal, but I absolutely detest breaking
>> APIs, config names, etc. I’ve seen it happen way too often in other
>> projects (even things we depend on that are officially post-1.0, like Akka
>> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
>> cutting-edge users are okay with libraries occasionally changing, but many
>> others will consider it a show-stopper. Given this, I think that any
>> cosmetic change now, even though it might improve clarity slightly, is not
>> worth the tradeoff in terms of creating an update barrier for existing
>> users.
>>
>> In the end, while some changes were made, most APIs remained the same and
>> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
>> this served the project very well, as compatibility means users are able to
>> upgrade and we keep as many people on the latest versions of Spark (though
>> maybe not the latest APIs of Spark) as possible.
>>
>> As Spark grows, I think compatibility actually becomes more important and
>> we should be more conservative rather than less. Today, there are very
>> likely more Spark programs running than there were at any other time in the
>> past. Spark is no longer a tool only used by advanced hackers, it is now
>> also running "traditional enterprise workloads.'' In many cases these jobs
>> are powering important processes long after the original author leaves.
>>
>> Broken APIs can also affect libraries that extend Spark. This dependency
>> can be even harder for users, as if the library has not been upgraded to
>> use new APIs and they need that library, they are stuck.
>>
>> Given all of this, I'd like to propose the following rubric as an
>> addition to our semantic versioning policy. After discussion and if
>> people agree this is a good idea, I'll call a vote of the PMC to ratify its
>> inclusion in the official policy.
>>
>> Considerations When Breaking APIs
>> The Spark project strives to avoid breaking APIs or silently changing
>> behavior, even at major versions. While this is not always possible, the
>> balance of the following factors should be considered before choosing to
>> break an API.
>>
>> Cost of Breaking an API
>> Breaking an API almost always has a non-trivial cost to the users of
>> Spark. A broken API means that Spark programs need to be rewritten before
>> they can be upgraded. However, there are a few considerations when thinking
>> about what the cost will be:
>>
>>    - Usage - an API that is actively used in many different places, is
>>    always very costly to break. While it is hard to know usage for sure, there
>>    are a bunch of ways that we can estimate:
>>    - How long has the API been in Spark?
>>       - Is the API common even for basic programs?
>>       - How often do we see recent questions in JIRA or mailing lists?
>>       - How often does it appear in StackOverflow or blogs?
>>       - Behavior after the break - How will a program that works today,
>>    work after the break? The following are listed roughly in order of
>>    increasing severity:
>>    - Will there be a compiler or linker error?
>>       - Will there be a runtime exception?
>>       - Will that exception happen after significant processing has been
>>       done?
>>       - Will we silently return different answers? (very hard to debug,
>>       might not even notice!)
>>
>>
>> Cost of Maintaining an API
>> Of course, the above does not mean that we will never break any APIs. We
>> must also consider the cost both to the project and to our users of keeping
>> the API in question.
>>
>>    - Project Costs - Every API we have needs to be tested and needs to
>>    keep working as other parts of the project changes. These costs are
>>    significantly exacerbated when external dependencies change (the JVM,
>>    Scala, etc). In some cases, while not completely technically infeasible,
>>    the cost of maintaining a particular API can become too high.
>>    - User Costs - APIs also have a cognitive cost to users learning
>>    Spark or trying to understand Spark programs. This cost becomes even higher
>>    when the API in question has confusing or undefined semantics.
>>
>>
>> Alternatives to Breaking an API
>> In cases where there is a "Bad API", but where the cost of removal is
>> also high, there are alternatives that should be considered that do not
>> hurt existing users but do address some of the maintenance costs.
>>
>>
>>    - Avoid Bad APIs - While this is a bit obvious, it is an important
>>    point. Anytime we are adding a new interface to Spark we should consider
>>    that we might be stuck with this API forever. Think deeply about how
>>    new APIs relate to existing ones, as well as how you expect them to evolve
>>    over time.
>>    - Deprecation Warnings - All deprecation warnings should point to a
>>    clear alternative and should never just say that an API is deprecated.
>>    - Updated Docs - Documentation should point to the "best" recommended
>>    way of performing a given task. In the cases where we maintain legacy
>>    documentation, we should clearly point to newer APIs and suggest to users
>>    the "right" way.
>>    - Community Work - Many people learn Spark by reading blogs and other
>>    sites such as StackOverflow. However, many of these resources are out of
>>    date. Update them, to reduce the cost of eventually removing deprecated
>>    APIs.
>>
>>
>> Examples
>>
>> Here are some examples of how I think the policy above could be applied
>> to different issues that have been discussed recently. These are only to
>> illustrate how to apply the above rubric, but are not intended to be part
>> of the official policy.
>>
>> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
>> multiple creation of SparkContexts #23311
>> <https://github.com/apache/spark/pull/23311>
>>
>>
>>    - Cost to Break - Multiple Contexts in a single JVM never worked
>>    properly. When users tried it they would nearly always report that Spark
>>    was broken (SPARK-2243
>>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>>    confusing set of logs messages. Given this, I think it is very unlikely
>>    that there are many real world use cases active today. Even those cases
>>    likely suffer from undiagnosed issues as there are many areas of Spark that
>>    assume a single context per JVM.
>>    - Cost to Maintain - We have recently had users ask on the mailing
>>    list if this was supported, as the conf led them to believe it was, and the
>>    existence of this configuration as "supported" makes it harder to reason
>>    about certain global state in SparkContext.
>>
>>
>> Decision: Remove this configuration and related code.
>>
>> [SPARK-25908] Remove registerTempTable #22921
>> <https://github.com/apache/spark/pull/22921/> (only looking at one API
>> of this PR)
>>
>>
>>    - Cost to Break - This is a wildly popular API of Spark SQL that has
>>    been there since the first release. There are tons of blog posts and
>>    examples that use this syntax if you google "dataframe
>>    registerTempTable
>>    <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>>    (even more than the "correct" API "dataframe createOrReplaceView
>>    <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>>    All of these will be invalid for users of Spark 3.0
>>    - Cost to Maintain - This is just an alias, so there is not a lot of
>>    extra machinery required to keep the API. Users have two ways to do the
>>    same thing, but we can note that this is just an alias in the docs.
>>
>>
>> Decision: Do not remove this API, I would even consider un-deprecating
>> it. I anecdotally asked several users and this is the API they prefer over
>> the "correct" one.
>>
>> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>> <https://github.com/apache/spark/pull/24195>
>>
>>    - Cost to Break - I think that this case actually exemplifies several
>>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>>    gives you no help, other than what version the function was removed in. In
>>    R, it points users to a really deep conversation on the semantics of time
>>    in Spark SQL. None of the messages tell you how you should correctly be
>>    parsing a timestamp that is given to you in a format other than UTC. My
>>    guess is all users will blindly flip the flag to true (to keep using this
>>    function), so you've only succeeded in annoying them.
>>    - Cost to Maintain - These are two relatively isolated expressions,
>>    there should be little cost to keeping them. Users can be confused by their
>>    semantics, so we probably should update the docs to point them to a best
>>    practice (I learned only by complaining on the PR, that a good practice is
>>    to parse timestamps including the timezone in the format expression, which
>>    naturally shifts them to UTC).
>>
>>
>> Decision: Do not deprecate these two functions. We should update the
>> docs to talk about best practices for parsing timestamps, including how to
>> correctly shift them to UTC for storage.
>>
>> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>> <https://github.com/apache/spark/pull/24902>
>>
>>
>>    - Cost to Break - The TRIM function takes two string parameters. If
>>    we switch the parameter order, queries that use the TRIM function would
>>    silently get different results on different versions of Spark. Users may
>>    not notice it for a long time and wrong query results may cause serious
>>    problems to users.
>>    - Cost to Maintain - We will have some inconsistency inside Spark, as
>>    the TRIM function in Scala API and in SQL have different parameter order.
>>
>>
>> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
>> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
>> (with a warning, not by removing) the SQL TRIM function and move users to
>> the SQL standard TRIM syntax.
>>
>> Thanks for taking the time to read this! Happy to discuss the specifics
>> and amend this policy as the community sees fit.
>>
>> Michael
>>
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Matei and Michael.

I'm also a big supporter for policy-based project management.

Before going further,

    1. Could you estimate how many revert commits are required in
`branch-3.0` for new rubric?
    2. Are you going to revert all removed test cases for the deprecated
ones?
    3. Does it make any delay for Apache Spark 3.0.0 release?
        (I believe it was previously scheduled on June before Spark Summit
2020)

Although there was a discussion already, I want to make the following tough
parts sure.

    4. We are not going to add Scala 2.11 API, right?
    5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
    6. Do we have enough resource for testing the deprecated ones?
        (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)

Especially, for (2) and (6), we know that keeping deprecated ones without
testings doesn't give us any support for the new rubric.

Bests,
Dongjoon.

On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <ma...@gmail.com>
wrote:

> +1 on this new rubric. It definitely captures the issues I’ve seen in
> Spark and in other projects. If we write down this rubric (or something
> like it), it will also be easier to refer to it during code reviews or in
> proposals of new APIs (we could ask “do you expect to have to change this
> API in the future, and if so, how”).
>
> Matei
>
> On Feb 24, 2020, at 3:02 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
> Hello Everyone,
>
> As more users have started upgrading to Spark 3.0 preview (including
> myself), there have been many discussions around APIs that have been broken
> compared with Spark 2.x. In many of these discussions, one of the
> rationales for breaking an API seems to be "Spark follows semantic
> versioning <https://spark.apache.org/versioning-policy.html>, so this
> major release is our chance to get it right [by breaking APIs]". Similarly,
> in many cases the response to questions about why an API was completely
> removed has been, "this API has been deprecated since x.x, so we have to
> remove it".
>
> As a long time contributor to and user of Spark this interpretation of the
> policy is concerning to me. This reasoning misses the intention of the
> original policy, and I am worried that it will hurt the long-term success
> of the project.
>
> I definitely understand that these are hard decisions, and I'm not
> proposing that we never remove anything from Spark. However, I would like
> to give some additional context and also propose a different rubric for
> thinking about API breakage moving forward.
>
> Spark adopted semantic versioning back in 2014 during the preparations for
> the 1.0 release. As this was the first major release -- and as, up until
> fairly recently, Spark had only been an academic project -- no real
> promises had been made about API stability ever.
>
> During the discussion, some committers suggested that this was an
> opportunity to clean up cruft and give the Spark APIs a once-over, making
> cosmetic changes to improve consistency. However, in the end, it was
> decided that in many cases it was not in the best interests of the Spark
> community to break things just because we could. Matei actually said it
> pretty forcefully
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
> :
>
> I know that some names are suboptimal, but I absolutely detest breaking
> APIs, config names, etc. I’ve seen it happen way too often in other
> projects (even things we depend on that are officially post-1.0, like Akka
> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
> cutting-edge users are okay with libraries occasionally changing, but many
> others will consider it a show-stopper. Given this, I think that any
> cosmetic change now, even though it might improve clarity slightly, is not
> worth the tradeoff in terms of creating an update barrier for existing
> users.
>
> In the end, while some changes were made, most APIs remained the same and
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
> this served the project very well, as compatibility means users are able to
> upgrade and we keep as many people on the latest versions of Spark (though
> maybe not the latest APIs of Spark) as possible.
>
> As Spark grows, I think compatibility actually becomes more important and
> we should be more conservative rather than less. Today, there are very
> likely more Spark programs running than there were at any other time in the
> past. Spark is no longer a tool only used by advanced hackers, it is now
> also running "traditional enterprise workloads.'' In many cases these jobs
> are powering important processes long after the original author leaves.
>
> Broken APIs can also affect libraries that extend Spark. This dependency
> can be even harder for users, as if the library has not been upgraded to
> use new APIs and they need that library, they are stuck.
>
> Given all of this, I'd like to propose the following rubric as an addition
> to our semantic versioning policy. After discussion and if people agree
> this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
> the official policy.
>
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    - Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    - How long has the API been in Spark?
>       - Is the API common even for basic programs?
>       - How often do we see recent questions in JIRA or mailing lists?
>       - How often does it appear in StackOverflow or blogs?
>       - Behavior after the break - How will a program that works today,
>    work after the break? The following are listed roughly in order of
>    increasing severity:
>    - Will there be a compiler or linker error?
>       - Will there be a runtime exception?
>       - Will that exception happen after significant processing has been
>       done?
>       - Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    - Project Costs - Every API we have needs to be tested and needs to
>    keep working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    - User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    - Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    - Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    - Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    - Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> Examples
>
> Here are some examples of how I think the policy above could be applied to
> different issues that have been discussed recently. These are only to
> illustrate how to apply the above rubric, but are not intended to be part
> of the official policy.
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
> multiple creation of SparkContexts #23311
> <https://github.com/apache/spark/pull/23311>
>
>
>    - Cost to Break - Multiple Contexts in a single JVM never worked
>    properly. When users tried it they would nearly always report that Spark
>    was broken (SPARK-2243
>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>    confusing set of logs messages. Given this, I think it is very unlikely
>    that there are many real world use cases active today. Even those cases
>    likely suffer from undiagnosed issues as there are many areas of Spark that
>    assume a single context per JVM.
>    - Cost to Maintain - We have recently had users ask on the mailing
>    list if this was supported, as the conf led them to believe it was, and the
>    existence of this configuration as "supported" makes it harder to reason
>    about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
> [SPARK-25908] Remove registerTempTable #22921
> <https://github.com/apache/spark/pull/22921/> (only looking at one API of
> this PR)
>
>
>    - Cost to Break - This is a wildly popular API of Spark SQL that has
>    been there since the first release. There are tons of blog posts and
>    examples that use this syntax if you google "dataframe
>    registerTempTable
>    <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>    (even more than the "correct" API "dataframe createOrReplaceView
>    <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>    All of these will be invalid for users of Spark 3.0
>    - Cost to Maintain - This is just an alias, so there is not a lot of
>    extra machinery required to keep the API. Users have two ways to do the
>    same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating
> it. I anecdotally asked several users and this is the API they prefer over
> the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> <https://github.com/apache/spark/pull/24195>
>
>    - Cost to Break - I think that this case actually exemplifies several
>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>    gives you no help, other than what version the function was removed in. In
>    R, it points users to a really deep conversation on the semantics of time
>    in Spark SQL. None of the messages tell you how you should correctly be
>    parsing a timestamp that is given to you in a format other than UTC. My
>    guess is all users will blindly flip the flag to true (to keep using this
>    function), so you've only succeeded in annoying them.
>    - Cost to Maintain - These are two relatively isolated expressions,
>    there should be little cost to keeping them. Users can be confused by their
>    semantics, so we probably should update the docs to point them to a best
>    practice (I learned only by complaining on the PR, that a good practice is
>    to parse timestamps including the timezone in the format expression, which
>    naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs
> to talk about best practices for parsing timestamps, including how to
> correctly shift them to UTC for storage.
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> <https://github.com/apache/spark/pull/24902>
>
>
>    - Cost to Break - The TRIM function takes two string parameters. If we
>    switch the parameter order, queries that use the TRIM function would
>    silently get different results on different versions of Spark. Users may
>    not notice it for a long time and wrong query results may cause serious
>    problems to users.
>    - Cost to Maintain - We will have some inconsistency inside Spark, as
>    the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
> (with a warning, not by removing) the SQL TRIM function and move users to
> the SQL standard TRIM syntax.
>
> Thanks for taking the time to read this! Happy to discuss the specifics
> and amend this policy as the community sees fit.
>
> Michael
>
>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Matei Zaharia <ma...@gmail.com>.

+1 on this new rubric. It definitely captures the issues I’ve seen in Spark and in other projects. If we write down this rubric (or something like it), it will also be easier to refer to it during code reviews or in proposals of new APIs (we could ask “do you expect to have to change this API in the future, and if so, how”).

Matei

> On Feb 24, 2020, at 3:02 PM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> Hello Everyone,
> 
> As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning <https://spark.apache.org/versioning-policy.html>, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".
> 
> As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.
> 
> I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.
> 
> Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.
> 
> During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>:
> 
> I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.
> 
> In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.
> 
> As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.
> 
> Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.
> 
> Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.
> 
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
> 
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 
> How long has the API been in Spark?
> Is the API common even for basic programs?
> How often do we see recent questions in JIRA or mailing lists?
> How often does it appear in StackOverflow or blogs?
> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
> Will there be a compiler or linker error?
> Will there be a runtime exception?
> Will that exception happen after significant processing has been done?
> Will we silently return different answers? (very hard to debug, might not even notice!)
> 
> Cost of Maintaining an API
> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
> 
> Alternatives to Breaking an API
> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
> 
> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
> 
> Examples
> 
> Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.
> 
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311 <https://github.com/apache/spark/pull/23311>
> Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243 <https://issues.apache.org/jira/browse/SPARK-2243>), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
> Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.
> 
> Decision: Remove this configuration and related code.
> 
> [SPARK-25908] Remove registerTempTable #22921 <https://github.com/apache/spark/pull/22921/> (only looking at one API of this PR)
> 
> Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>" (even more than the "correct" API "dataframe createOrReplaceView <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>"). All of these will be invalid for users of Spark 3.0
> Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.
> 
> Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195 <https://github.com/apache/spark/pull/24195>
> Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
> Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).
> 
> Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.
> 
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902 <https://github.com/apache/spark/pull/24902>
> Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
> Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.
> 
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.
> 
> Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.
> 
> Michael
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Michael Armbrust <mi...@databricks.com>.

Thanks for the discussion! A few responses:

The decision needs to happen at api/config change time, otherwise the
> deprecated warning has no purpose if we are never going to remove them.
>

Even if we never remove an API, I think deprecation warnings (when done
right) can still serve a purpose. For new users, a deprecation can serve as
a pointer to newer, faster APIs or ones with less sharp edges. I would be
supportive of efforts that use them to clean up the docs. For example, we
could hide deprecated APIs after some time so they don't clutter scala/java
doc. We can and should audit things like the user guide and our own
examples to make sure they don't use deprecated APIs.


> That said we still need to be able to remove deprecated things and change
> APIs in major releases, otherwise why do a  major release in the first
> place.  Is it purely to support newer Scala/python/java versions.
>

I don't think Major versions are purely for
Scala/Java/Python/Hive/Metastore, but they are a good chance to move the
project forward. Spark 3.0 has a lot of upgrades here, and I think we did
make the right trade-offs here, even though there are some API breaks.

Major versions are also a good time to drop major changes (i.e. in 2.0 we
released whole-stage code gen).


> I think the hardest part listed here is what the impact is.  Who's call is
> that, it's hard to know how everyone is using things and I think it's been
> harder to get feedback on SPIPs and API changes in general as people are
> busy with other things.
>

This is the hardest part, and we won't always get it right. I think that
having the rubric though will help guide the conversation and help
reviewers ask the right questions.

One other thing I'll add is, sometimes the users come to us and we should
listen! I was very surprised by the response to Karen's email on this list
last week. An actual user was giving us feedback on the impact of the
changes in Spark 3.0 and rather than listen there was a lot of push back.
Users are never wrong when they are telling you what matters to them!


> Like you mention, I think stackoverflow is unreliable, the posts could be
> many years old and no longer relevant.
>

While this is unfortunate, I think the more we can do to keep these
answer relevant (either by updating them or by not breaking them) is good
for the health of the Spark community.

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

In general +1 I think these are good guidelines and making it easier to upgrade is beneficial to everyone. The decision needs to happen at api/config change time, otherwise the deprecated warning has no purpose if we are never going to remove them.That said we still need to be able to remove deprecated things and change APIs in major releases, otherwise why do a major release in the first place. Is it purely to support newer Scala/python/java versions.
I think the hardest part listed here is what the impact is. Who's call is that, it's hard to know how everyone is using things and I think it's been harder to get feedback on SPIPs and API changes in general as people are busy with other things. Like you mention, I think stackoverflow is unreliable, the posts could be many years old and no longer relevant.
Tom On Monday, February 24, 2020, 05:03:44 PM CST, Michael Armbrust <mi...@databricks.com> wrote:

Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".

As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.

I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.

During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:

I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.

In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

-
Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:

-
How long has the API been in Spark?

-
Is the API common even for basic programs?

-
How often do we see recent questions in JIRA or mailing lists?

-
How often does it appear in StackOverflow or blogs?

-
Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

-
Will there be a compiler or linker error?

-
Will there be a runtime exception?

-
Will that exception happen after significant processing has been done?

-
Will we silently return different answers? (very hard to debug, might not even notice!)

Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

-
Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

-
User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.

Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.

-
Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

-
Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

-
Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

-
Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.

Examples

Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311

-
Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.

-
Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.

Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)

-
Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0

-
Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.

Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195

-
Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.

-
Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).

Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902

-
Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.

-
Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.

Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.

Michael

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Jules Damji <dm...@comcast.net>.

+1 

Well said! 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 24, 2020, at 3:03 PM, Michael Armbrust <mi...@databricks.com> wrote:
> 
> 
> Hello Everyone,
> 
> As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic versioning, so this major release is our chance to get it right [by breaking APIs]". Similarly, in many cases the response to questions about why an API was completely removed has been, "this API has been deprecated since x.x, so we have to remove it".
> 
> As a long time contributor to and user of Spark this interpretation of the policy is concerning to me. This reasoning misses the intention of the original policy, and I am worried that it will hurt the long-term success of the project.
> 
> I definitely understand that these are hard decisions, and I'm not proposing that we never remove anything from Spark. However, I would like to give some additional context and also propose a different rubric for thinking about API breakage moving forward.
> 
> Spark adopted semantic versioning back in 2014 during the preparations for the 1.0 release. As this was the first major release -- and as, up until fairly recently, Spark had only been an academic project -- no real promises had been made about API stability ever.
> 
> During the discussion, some committers suggested that this was an opportunity to clean up cruft and give the Spark APIs a once-over, making cosmetic changes to improve consistency. However, in the end, it was decided that in many cases it was not in the best interests of the Spark community to break things just because we could. Matei actually said it pretty forcefully:
> 
> I know that some names are suboptimal, but I absolutely detest breaking APIs, config names, etc. I’ve seen it happen way too often in other projects (even things we depend on that are officially post-1.0, like Akka or Protobuf or Hadoop), and it’s very painful. I think that we as fairly cutting-edge users are okay with libraries occasionally changing, but many others will consider it a show-stopper. Given this, I think that any cosmetic change now, even though it might improve clarity slightly, is not worth the tradeoff in terms of creating an update barrier for existing users.
> 
> In the end, while some changes were made, most APIs remained the same and users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served the project very well, as compatibility means users are able to upgrade and we keep as many people on the latest versions of Spark (though maybe not the latest APIs of Spark) as possible.
> 
> As Spark grows, I think compatibility actually becomes more important and we should be more conservative rather than less. Today, there are very likely more Spark programs running than there were at any other time in the past. Spark is no longer a tool only used by advanced hackers, it is now also running "traditional enterprise workloads.'' In many cases these jobs are powering important processes long after the original author leaves.
> 
> Broken APIs can also affect libraries that extend Spark. This dependency can be even harder for users, as if the library has not been upgraded to use new APIs and they need that library, they are stuck.
> 
> Given all of this, I'd like to propose the following rubric as an addition to our semantic versioning policy. After discussion and if people agree this is a good idea, I'll call a vote of the PMC to ratify its inclusion in the official policy.
> 
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
> 
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate: 
> How long has the API been in Spark?
> Is the API common even for basic programs?
> How often do we see recent questions in JIRA or mailing lists?
> How often does it appear in StackOverflow or blogs?
> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
> Will there be a compiler or linker error?
> Will there be a runtime exception?
> Will that exception happen after significant processing has been done?
> Will we silently return different answers? (very hard to debug, might not even notice!)
> 
> Cost of Maintaining an API
> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
> 
> Alternatives to Breaking an API
> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
> 
> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
> 
> Examples
> 
> Here are some examples of how I think the policy above could be applied to different issues that have been discussed recently. These are only to illustrate how to apply the above rubric, but are not intended to be part of the official policy.
> 
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts #23311
> 
> Cost to Break - Multiple Contexts in a single JVM never worked properly. When users tried it they would nearly always report that Spark was broken (SPARK-2243), due to the confusing set of logs messages. Given this, I think it is very unlikely that there are many real world use cases active today. Even those cases likely suffer from undiagnosed issues as there are many areas of Spark that assume a single context per JVM.
> Cost to Maintain - We have recently had users ask on the mailing list if this was supported, as the conf led them to believe it was, and the existence of this configuration as "supported" makes it harder to reason about certain global state in SparkContext.
> 
> Decision: Remove this configuration and related code.
> 
> [SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this PR)
> 
> Cost to Break - This is a wildly popular API of Spark SQL that has been there since the first release. There are tons of blog posts and examples that use this syntax if you google "dataframe registerTempTable" (even more than the "correct" API "dataframe createOrReplaceView"). All of these will be invalid for users of Spark 3.0
> Cost to Maintain - This is just an alias, so there is not a lot of extra machinery required to keep the API. Users have two ways to do the same thing, but we can note that this is just an alias in the docs.
> 
> Decision: Do not remove this API, I would even consider un-deprecating it. I anecdotally asked several users and this is the API they prefer over the "correct" one.
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> 
> Cost to Break - I think that this case actually exemplifies several anti-patterns in breaking APIs. In some languages, the deprecation warning gives you no help, other than what version the function was removed in. In R, it points users to a really deep conversation on the semantics of time in Spark SQL. None of the messages tell you how you should correctly be parsing a timestamp that is given to you in a format other than UTC. My guess is all users will blindly flip the flag to true (to keep using this function), so you've only succeeded in annoying them.
> Cost to Maintain - These are two relatively isolated expressions, there should be little cost to keeping them. Users can be confused by their semantics, so we probably should update the docs to point them to a best practice (I learned only by complaining on the PR, that a good practice is to parse timestamps including the timezone in the format expression, which naturally shifts them to UTC).
> 
> Decision: Do not deprecate these two functions. We should update the docs to talk about best practices for parsing timestamps, including how to correctly shift them to UTC for storage.
> 
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> 
> Cost to Break - The TRIM function takes two string parameters. If we switch the parameter order, queries that use the TRIM function would silently get different results on different versions of Spark. Users may not notice it for a long time and wrong query results may cause serious problems to users.
> Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM function in Scala API and in SQL have different parameter order.
> 
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a warning, not by removing) the SQL TRIM function and move users to the SQL standard TRIM syntax.
> 
> Thanks for taking the time to read this! Happy to discuss the specifics and amend this policy as the community sees fit.
> 
> Michael
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Michel Miotto Barbosa <mm...@gmail.com>.

+1


*_________________________________________________________*

*Michel Miotto Barbosa*, *Data Science/Software Engineer*

Learn MBA Global Financial Broker at IBMEC SP,
Learn Economic Science at PUC SP

MBA in Project Management, Graduate i
n
S
oftware
E
ngineering

phone: +55 11 984 342 347,  @michelmb <https://twitter.com/MichelMB>

https://br.linkedin.com/in/michelmiottobarbosa



Em seg., 24 de fev. de 2020 às 20:03, Michael Armbrust <
michael@databricks.com> escreveu:

> Hello Everyone,
>
> As more users have started upgrading to Spark 3.0 preview (including
> myself), there have been many discussions around APIs that have been broken
> compared with Spark 2.x. In many of these discussions, one of the
> rationales for breaking an API seems to be "Spark follows semantic
> versioning <https://spark.apache.org/versioning-policy.html>, so this
> major release is our chance to get it right [by breaking APIs]". Similarly,
> in many cases the response to questions about why an API was completely
> removed has been, "this API has been deprecated since x.x, so we have to
> remove it".
>
> As a long time contributor to and user of Spark this interpretation of the
> policy is concerning to me. This reasoning misses the intention of the
> original policy, and I am worried that it will hurt the long-term success
> of the project.
>
> I definitely understand that these are hard decisions, and I'm not
> proposing that we never remove anything from Spark. However, I would like
> to give some additional context and also propose a different rubric for
> thinking about API breakage moving forward.
>
> Spark adopted semantic versioning back in 2014 during the preparations for
> the 1.0 release. As this was the first major release -- and as, up until
> fairly recently, Spark had only been an academic project -- no real
> promises had been made about API stability ever.
>
> During the discussion, some committers suggested that this was an
> opportunity to clean up cruft and give the Spark APIs a once-over, making
> cosmetic changes to improve consistency. However, in the end, it was
> decided that in many cases it was not in the best interests of the Spark
> community to break things just because we could. Matei actually said it
> pretty forcefully
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
> :
>
> I know that some names are suboptimal, but I absolutely detest breaking
> APIs, config names, etc. I’ve seen it happen way too often in other
> projects (even things we depend on that are officially post-1.0, like Akka
> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
> cutting-edge users are okay with libraries occasionally changing, but many
> others will consider it a show-stopper. Given this, I think that any
> cosmetic change now, even though it might improve clarity slightly, is not
> worth the tradeoff in terms of creating an update barrier for existing
> users.
>
> In the end, while some changes were made, most APIs remained the same and
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
> this served the project very well, as compatibility means users are able to
> upgrade and we keep as many people on the latest versions of Spark (though
> maybe not the latest APIs of Spark) as possible.
>
> As Spark grows, I think compatibility actually becomes more important and
> we should be more conservative rather than less. Today, there are very
> likely more Spark programs running than there were at any other time in the
> past. Spark is no longer a tool only used by advanced hackers, it is now
> also running "traditional enterprise workloads.'' In many cases these jobs
> are powering important processes long after the original author leaves.
>
> Broken APIs can also affect libraries that extend Spark. This dependency
> can be even harder for users, as if the library has not been upgraded to
> use new APIs and they need that library, they are stuck.
>
> Given all of this, I'd like to propose the following rubric as an addition
> to our semantic versioning policy. After discussion and if people agree
> this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
> the official policy.
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    -
>
>    Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    -
>
>       How long has the API been in Spark?
>       -
>
>       Is the API common even for basic programs?
>       -
>
>       How often do we see recent questions in JIRA or mailing lists?
>       -
>
>       How often does it appear in StackOverflow or blogs?
>       -
>
>    Behavior after the break - How will a program that works today, work
>    after the break? The following are listed roughly in order of increasing
>    severity:
>    -
>
>       Will there be a compiler or linker error?
>       -
>
>       Will there be a runtime exception?
>       -
>
>       Will that exception happen after significant processing has been
>       done?
>       -
>
>       Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    -
>
>    Project Costs - Every API we have needs to be tested and needs to keep
>    working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    -
>
>    User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    -
>
>    Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    -
>
>    Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    -
>
>    Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    -
>
>    Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> Examples
>
> Here are some examples of how I think the policy above could be applied to
> different issues that have been discussed recently. These are only to
> illustrate how to apply the above rubric, but are not intended to be part
> of the official policy.
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
> multiple creation of SparkContexts #23311
> <https://github.com/apache/spark/pull/23311>
>
>
>    -
>
>    Cost to Break - Multiple Contexts in a single JVM never worked
>    properly. When users tried it they would nearly always report that Spark
>    was broken (SPARK-2243
>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>    confusing set of logs messages. Given this, I think it is very unlikely
>    that there are many real world use cases active today. Even those cases
>    likely suffer from undiagnosed issues as there are many areas of Spark that
>    assume a single context per JVM.
>    -
>
>    Cost to Maintain - We have recently had users ask on the mailing list
>    if this was supported, as the conf led them to believe it was, and the
>    existence of this configuration as "supported" makes it harder to reason
>    about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
> [SPARK-25908] Remove registerTempTable #22921
> <https://github.com/apache/spark/pull/22921/> (only looking at one API of
> this PR)
>
>
>    -
>
>    Cost to Break - This is a wildly popular API of Spark SQL that has
>    been there since the first release. There are tons of blog posts and
>    examples that use this syntax if you google "dataframe
>    registerTempTable
>    <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>    (even more than the "correct" API "dataframe createOrReplaceView
>    <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>    All of these will be invalid for users of Spark 3.0
>    -
>
>    Cost to Maintain - This is just an alias, so there is not a lot of
>    extra machinery required to keep the API. Users have two ways to do the
>    same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating
> it. I anecdotally asked several users and this is the API they prefer over
> the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> <https://github.com/apache/spark/pull/24195>
>
>    -
>
>    Cost to Break - I think that this case actually exemplifies several
>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>    gives you no help, other than what version the function was removed in. In
>    R, it points users to a really deep conversation on the semantics of time
>    in Spark SQL. None of the messages tell you how you should correctly be
>    parsing a timestamp that is given to you in a format other than UTC. My
>    guess is all users will blindly flip the flag to true (to keep using this
>    function), so you've only succeeded in annoying them.
>    -
>
>    Cost to Maintain - These are two relatively isolated expressions,
>    there should be little cost to keeping them. Users can be confused by their
>    semantics, so we probably should update the docs to point them to a best
>    practice (I learned only by complaining on the PR, that a good practice is
>    to parse timestamps including the timezone in the format expression, which
>    naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs
> to talk about best practices for parsing timestamps, including how to
> correctly shift them to UTC for storage.
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> <https://github.com/apache/spark/pull/24902>
>
>
>    -
>
>    Cost to Break - The TRIM function takes two string parameters. If we
>    switch the parameter order, queries that use the TRIM function would
>    silently get different results on different versions of Spark. Users may
>    not notice it for a long time and wrong query results may cause serious
>    problems to users.
>    -
>
>    Cost to Maintain - We will have some inconsistency inside Spark, as
>    the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
> (with a warning, not by removing) the SQL TRIM function and move users to
> the SQL standard TRIM syntax.
>
> Thanks for taking the time to read this! Happy to discuss the specifics
> and amend this policy as the community sees fit.
>
> Michael
>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by John Zhuge <jz...@apache.org>.

Well written, Michael!

Believe it or not, I read through the entire email, very rare for emails of
such length. Happy to see healthy discussions on this tough subject.
Definitely need perspectives form both the users and the contributors.


On Tue, Feb 25, 2020 at 9:09 PM Xiao Li <ga...@gmail.com> wrote:

> +1
>
> Xiao
>
> Michael Armbrust <mi...@databricks.com> 于2020年2月24日周一 下午3:03写道：
>
>> Hello Everyone,
>>
>> As more users have started upgrading to Spark 3.0 preview (including
>> myself), there have been many discussions around APIs that have been broken
>> compared with Spark 2.x. In many of these discussions, one of the
>> rationales for breaking an API seems to be "Spark follows semantic
>> versioning <https://spark.apache.org/versioning-policy.html>, so this
>> major release is our chance to get it right [by breaking APIs]". Similarly,
>> in many cases the response to questions about why an API was completely
>> removed has been, "this API has been deprecated since x.x, so we have to
>> remove it".
>>
>> As a long time contributor to and user of Spark this interpretation of
>> the policy is concerning to me. This reasoning misses the intention of the
>> original policy, and I am worried that it will hurt the long-term success
>> of the project.
>>
>> I definitely understand that these are hard decisions, and I'm not
>> proposing that we never remove anything from Spark. However, I would like
>> to give some additional context and also propose a different rubric for
>> thinking about API breakage moving forward.
>>
>> Spark adopted semantic versioning back in 2014 during the preparations
>> for the 1.0 release. As this was the first major release -- and as, up
>> until fairly recently, Spark had only been an academic project -- no real
>> promises had been made about API stability ever.
>>
>> During the discussion, some committers suggested that this was an
>> opportunity to clean up cruft and give the Spark APIs a once-over, making
>> cosmetic changes to improve consistency. However, in the end, it was
>> decided that in many cases it was not in the best interests of the Spark
>> community to break things just because we could. Matei actually said it
>> pretty forcefully
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
>> :
>>
>> I know that some names are suboptimal, but I absolutely detest breaking
>> APIs, config names, etc. I’ve seen it happen way too often in other
>> projects (even things we depend on that are officially post-1.0, like Akka
>> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
>> cutting-edge users are okay with libraries occasionally changing, but many
>> others will consider it a show-stopper. Given this, I think that any
>> cosmetic change now, even though it might improve clarity slightly, is not
>> worth the tradeoff in terms of creating an update barrier for existing
>> users.
>>
>> In the end, while some changes were made, most APIs remained the same and
>> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
>> this served the project very well, as compatibility means users are able to
>> upgrade and we keep as many people on the latest versions of Spark (though
>> maybe not the latest APIs of Spark) as possible.
>>
>> As Spark grows, I think compatibility actually becomes more important and
>> we should be more conservative rather than less. Today, there are very
>> likely more Spark programs running than there were at any other time in the
>> past. Spark is no longer a tool only used by advanced hackers, it is now
>> also running "traditional enterprise workloads.'' In many cases these jobs
>> are powering important processes long after the original author leaves.
>>
>> Broken APIs can also affect libraries that extend Spark. This dependency
>> can be even harder for users, as if the library has not been upgraded to
>> use new APIs and they need that library, they are stuck.
>>
>> Given all of this, I'd like to propose the following rubric as an
>> addition to our semantic versioning policy. After discussion and if
>> people agree this is a good idea, I'll call a vote of the PMC to ratify its
>> inclusion in the official policy.
>>
>> Considerations When Breaking APIs
>>
>> The Spark project strives to avoid breaking APIs or silently changing
>> behavior, even at major versions. While this is not always possible, the
>> balance of the following factors should be considered before choosing to
>> break an API.
>>
>> Cost of Breaking an API
>>
>> Breaking an API almost always has a non-trivial cost to the users of
>> Spark. A broken API means that Spark programs need to be rewritten before
>> they can be upgraded. However, there are a few considerations when thinking
>> about what the cost will be:
>>
>>    -
>>
>>    Usage - an API that is actively used in many different places, is
>>    always very costly to break. While it is hard to know usage for sure, there
>>    are a bunch of ways that we can estimate:
>>    -
>>
>>       How long has the API been in Spark?
>>       -
>>
>>       Is the API common even for basic programs?
>>       -
>>
>>       How often do we see recent questions in JIRA or mailing lists?
>>       -
>>
>>       How often does it appear in StackOverflow or blogs?
>>       -
>>
>>    Behavior after the break - How will a program that works today, work
>>    after the break? The following are listed roughly in order of increasing
>>    severity:
>>    -
>>
>>       Will there be a compiler or linker error?
>>       -
>>
>>       Will there be a runtime exception?
>>       -
>>
>>       Will that exception happen after significant processing has been
>>       done?
>>       -
>>
>>       Will we silently return different answers? (very hard to debug,
>>       might not even notice!)
>>
>>
>> Cost of Maintaining an API
>>
>> Of course, the above does not mean that we will never break any APIs. We
>> must also consider the cost both to the project and to our users of keeping
>> the API in question.
>>
>>    -
>>
>>    Project Costs - Every API we have needs to be tested and needs to
>>    keep working as other parts of the project changes. These costs are
>>    significantly exacerbated when external dependencies change (the JVM,
>>    Scala, etc). In some cases, while not completely technically infeasible,
>>    the cost of maintaining a particular API can become too high.
>>    -
>>
>>    User Costs - APIs also have a cognitive cost to users learning Spark
>>    or trying to understand Spark programs. This cost becomes even higher when
>>    the API in question has confusing or undefined semantics.
>>
>>
>> Alternatives to Breaking an API
>>
>> In cases where there is a "Bad API", but where the cost of removal is
>> also high, there are alternatives that should be considered that do not
>> hurt existing users but do address some of the maintenance costs.
>>
>>
>>    -
>>
>>    Avoid Bad APIs - While this is a bit obvious, it is an important
>>    point. Anytime we are adding a new interface to Spark we should consider
>>    that we might be stuck with this API forever. Think deeply about how
>>    new APIs relate to existing ones, as well as how you expect them to evolve
>>    over time.
>>    -
>>
>>    Deprecation Warnings - All deprecation warnings should point to a
>>    clear alternative and should never just say that an API is deprecated.
>>    -
>>
>>    Updated Docs - Documentation should point to the "best" recommended
>>    way of performing a given task. In the cases where we maintain legacy
>>    documentation, we should clearly point to newer APIs and suggest to users
>>    the "right" way.
>>    -
>>
>>    Community Work - Many people learn Spark by reading blogs and other
>>    sites such as StackOverflow. However, many of these resources are out of
>>    date. Update them, to reduce the cost of eventually removing deprecated
>>    APIs.
>>
>>
>> Examples
>>
>> Here are some examples of how I think the policy above could be applied
>> to different issues that have been discussed recently. These are only to
>> illustrate how to apply the above rubric, but are not intended to be part
>> of the official policy.
>>
>> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
>> multiple creation of SparkContexts #23311
>> <https://github.com/apache/spark/pull/23311>
>>
>>
>>    -
>>
>>    Cost to Break - Multiple Contexts in a single JVM never worked
>>    properly. When users tried it they would nearly always report that Spark
>>    was broken (SPARK-2243
>>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>>    confusing set of logs messages. Given this, I think it is very unlikely
>>    that there are many real world use cases active today. Even those cases
>>    likely suffer from undiagnosed issues as there are many areas of Spark that
>>    assume a single context per JVM.
>>    -
>>
>>    Cost to Maintain - We have recently had users ask on the mailing list
>>    if this was supported, as the conf led them to believe it was, and the
>>    existence of this configuration as "supported" makes it harder to reason
>>    about certain global state in SparkContext.
>>
>>
>> Decision: Remove this configuration and related code.
>>
>> [SPARK-25908] Remove registerTempTable #22921
>> <https://github.com/apache/spark/pull/22921/> (only looking at one API
>> of this PR)
>>
>>
>>    -
>>
>>    Cost to Break - This is a wildly popular API of Spark SQL that has
>>    been there since the first release. There are tons of blog posts and
>>    examples that use this syntax if you google "dataframe
>>    registerTempTable
>>    <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>>    (even more than the "correct" API "dataframe createOrReplaceView
>>    <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>>    All of these will be invalid for users of Spark 3.0
>>    -
>>
>>    Cost to Maintain - This is just an alias, so there is not a lot of
>>    extra machinery required to keep the API. Users have two ways to do the
>>    same thing, but we can note that this is just an alias in the docs.
>>
>>
>> Decision: Do not remove this API, I would even consider un-deprecating
>> it. I anecdotally asked several users and this is the API they prefer over
>> the "correct" one.
>>
>> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>> <https://github.com/apache/spark/pull/24195>
>>
>>    -
>>
>>    Cost to Break - I think that this case actually exemplifies several
>>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>>    gives you no help, other than what version the function was removed in. In
>>    R, it points users to a really deep conversation on the semantics of time
>>    in Spark SQL. None of the messages tell you how you should correctly be
>>    parsing a timestamp that is given to you in a format other than UTC. My
>>    guess is all users will blindly flip the flag to true (to keep using this
>>    function), so you've only succeeded in annoying them.
>>    -
>>
>>    Cost to Maintain - These are two relatively isolated expressions,
>>    there should be little cost to keeping them. Users can be confused by their
>>    semantics, so we probably should update the docs to point them to a best
>>    practice (I learned only by complaining on the PR, that a good practice is
>>    to parse timestamps including the timezone in the format expression, which
>>    naturally shifts them to UTC).
>>
>>
>> Decision: Do not deprecate these two functions. We should update the
>> docs to talk about best practices for parsing timestamps, including how to
>> correctly shift them to UTC for storage.
>>
>> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>> <https://github.com/apache/spark/pull/24902>
>>
>>
>>    -
>>
>>    Cost to Break - The TRIM function takes two string parameters. If we
>>    switch the parameter order, queries that use the TRIM function would
>>    silently get different results on different versions of Spark. Users may
>>    not notice it for a long time and wrong query results may cause serious
>>    problems to users.
>>    -
>>
>>    Cost to Maintain - We will have some inconsistency inside Spark, as
>>    the TRIM function in Scala API and in SQL have different parameter order.
>>
>>
>> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
>> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
>> (with a warning, not by removing) the SQL TRIM function and move users to
>> the SQL standard TRIM syntax.
>>
>> Thanks for taking the time to read this! Happy to discuss the specifics
>> and amend this policy as the community sees fit.
>>
>> Michael
>>
>>

-- 
John Zhuge

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Xiao Li <ga...@gmail.com>.

+1

Xiao

Michael Armbrust <mi...@databricks.com> 于2020年2月24日周一 下午3:03写道：

> Hello Everyone,
>
> As more users have started upgrading to Spark 3.0 preview (including
> myself), there have been many discussions around APIs that have been broken
> compared with Spark 2.x. In many of these discussions, one of the
> rationales for breaking an API seems to be "Spark follows semantic
> versioning <https://spark.apache.org/versioning-policy.html>, so this
> major release is our chance to get it right [by breaking APIs]". Similarly,
> in many cases the response to questions about why an API was completely
> removed has been, "this API has been deprecated since x.x, so we have to
> remove it".
>
> As a long time contributor to and user of Spark this interpretation of the
> policy is concerning to me. This reasoning misses the intention of the
> original policy, and I am worried that it will hurt the long-term success
> of the project.
>
> I definitely understand that these are hard decisions, and I'm not
> proposing that we never remove anything from Spark. However, I would like
> to give some additional context and also propose a different rubric for
> thinking about API breakage moving forward.
>
> Spark adopted semantic versioning back in 2014 during the preparations for
> the 1.0 release. As this was the first major release -- and as, up until
> fairly recently, Spark had only been an academic project -- no real
> promises had been made about API stability ever.
>
> During the discussion, some committers suggested that this was an
> opportunity to clean up cruft and give the Spark APIs a once-over, making
> cosmetic changes to improve consistency. However, in the end, it was
> decided that in many cases it was not in the best interests of the Spark
> community to break things just because we could. Matei actually said it
> pretty forcefully
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
> :
>
> I know that some names are suboptimal, but I absolutely detest breaking
> APIs, config names, etc. I’ve seen it happen way too often in other
> projects (even things we depend on that are officially post-1.0, like Akka
> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
> cutting-edge users are okay with libraries occasionally changing, but many
> others will consider it a show-stopper. Given this, I think that any
> cosmetic change now, even though it might improve clarity slightly, is not
> worth the tradeoff in terms of creating an update barrier for existing
> users.
>
> In the end, while some changes were made, most APIs remained the same and
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
> this served the project very well, as compatibility means users are able to
> upgrade and we keep as many people on the latest versions of Spark (though
> maybe not the latest APIs of Spark) as possible.
>
> As Spark grows, I think compatibility actually becomes more important and
> we should be more conservative rather than less. Today, there are very
> likely more Spark programs running than there were at any other time in the
> past. Spark is no longer a tool only used by advanced hackers, it is now
> also running "traditional enterprise workloads.'' In many cases these jobs
> are powering important processes long after the original author leaves.
>
> Broken APIs can also affect libraries that extend Spark. This dependency
> can be even harder for users, as if the library has not been upgraded to
> use new APIs and they need that library, they are stuck.
>
> Given all of this, I'd like to propose the following rubric as an addition
> to our semantic versioning policy. After discussion and if people agree
> this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
> the official policy.
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    -
>
>    Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    -
>
>       How long has the API been in Spark?
>       -
>
>       Is the API common even for basic programs?
>       -
>
>       How often do we see recent questions in JIRA or mailing lists?
>       -
>
>       How often does it appear in StackOverflow or blogs?
>       -
>
>    Behavior after the break - How will a program that works today, work
>    after the break? The following are listed roughly in order of increasing
>    severity:
>    -
>
>       Will there be a compiler or linker error?
>       -
>
>       Will there be a runtime exception?
>       -
>
>       Will that exception happen after significant processing has been
>       done?
>       -
>
>       Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    -
>
>    Project Costs - Every API we have needs to be tested and needs to keep
>    working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    -
>
>    User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    -
>
>    Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    -
>
>    Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    -
>
>    Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    -
>
>    Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> Examples
>
> Here are some examples of how I think the policy above could be applied to
> different issues that have been discussed recently. These are only to
> illustrate how to apply the above rubric, but are not intended to be part
> of the official policy.
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
> multiple creation of SparkContexts #23311
> <https://github.com/apache/spark/pull/23311>
>
>
>    -
>
>    Cost to Break - Multiple Contexts in a single JVM never worked
>    properly. When users tried it they would nearly always report that Spark
>    was broken (SPARK-2243
>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>    confusing set of logs messages. Given this, I think it is very unlikely
>    that there are many real world use cases active today. Even those cases
>    likely suffer from undiagnosed issues as there are many areas of Spark that
>    assume a single context per JVM.
>    -
>
>    Cost to Maintain - We have recently had users ask on the mailing list
>    if this was supported, as the conf led them to believe it was, and the
>    existence of this configuration as "supported" makes it harder to reason
>    about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
> [SPARK-25908] Remove registerTempTable #22921
> <https://github.com/apache/spark/pull/22921/> (only looking at one API of
> this PR)
>
>
>    -
>
>    Cost to Break - This is a wildly popular API of Spark SQL that has
>    been there since the first release. There are tons of blog posts and
>    examples that use this syntax if you google "dataframe
>    registerTempTable
>    <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>    (even more than the "correct" API "dataframe createOrReplaceView
>    <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>    All of these will be invalid for users of Spark 3.0
>    -
>
>    Cost to Maintain - This is just an alias, so there is not a lot of
>    extra machinery required to keep the API. Users have two ways to do the
>    same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating
> it. I anecdotally asked several users and this is the API they prefer over
> the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> <https://github.com/apache/spark/pull/24195>
>
>    -
>
>    Cost to Break - I think that this case actually exemplifies several
>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>    gives you no help, other than what version the function was removed in. In
>    R, it points users to a really deep conversation on the semantics of time
>    in Spark SQL. None of the messages tell you how you should correctly be
>    parsing a timestamp that is given to you in a format other than UTC. My
>    guess is all users will blindly flip the flag to true (to keep using this
>    function), so you've only succeeded in annoying them.
>    -
>
>    Cost to Maintain - These are two relatively isolated expressions,
>    there should be little cost to keeping them. Users can be confused by their
>    semantics, so we probably should update the docs to point them to a best
>    practice (I learned only by complaining on the PR, that a good practice is
>    to parse timestamps including the timezone in the format expression, which
>    naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs
> to talk about best practices for parsing timestamps, including how to
> correctly shift them to UTC for storage.
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> <https://github.com/apache/spark/pull/24902>
>
>
>    -
>
>    Cost to Break - The TRIM function takes two string parameters. If we
>    switch the parameter order, queries that use the TRIM function would
>    silently get different results on different versions of Spark. Users may
>    not notice it for a long time and wrong query results may cause serious
>    problems to users.
>    -
>
>    Cost to Maintain - We will have some inconsistency inside Spark, as
>    the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
> (with a warning, not by removing) the SQL TRIM function and move users to
> the SQL standard TRIM syntax.
>
> Thanks for taking the time to read this! Happy to discuss the specifics
> and amend this policy as the community sees fit.
>
> Michael
>
>