You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2020/03/06 21:01:58 UTC

[VOTE] Amend Spark's Semantic Versioning Policy

I propose to add the following text to Spark's Semantic Versioning policy
<https://spark.apache.org/versioning-policy.html> and adopt it as the
rubric that should be used when deciding to break APIs (even at major
versions such as 3.0).


I'll leave the vote open until Tuesday, March 10th at 2pm. As this is
a procedural
vote <https://www.apache.org/foundation/voting.html>, the measure will pass
if there are more favourable votes than unfavourable ones. PMC votes are
binding, but the community is encouraged to add their voice to the
discussion.


[ ] +1 - Spark should adopt this policy.

[ ] -1  - Spark should not adopt this policy.


<new policy>


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

   -

   Usage - an API that is actively used in many different places, is always
   very costly to break. While it is hard to know usage for sure, there are a
   bunch of ways that we can estimate:
   -

      How long has the API been in Spark?
      -

      Is the API common even for basic programs?
      -

      How often do we see recent questions in JIRA or mailing lists?
      -

      How often does it appear in StackOverflow or blogs?
      -

   Behavior after the break - How will a program that works today, work
   after the break? The following are listed roughly in order of increasing
   severity:
   -

      Will there be a compiler or linker error?
      -

      Will there be a runtime exception?
      -

      Will that exception happen after significant processing has been done?
      -

      Will we silently return different answers? (very hard to debug, might
      not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We
must also consider the cost both to the project and to our users of keeping
the API in question.

   -

   Project Costs - Every API we have needs to be tested and needs to keep
   working as other parts of the project changes. These costs are
   significantly exacerbated when external dependencies change (the JVM,
   Scala, etc). In some cases, while not completely technically infeasible,
   the cost of maintaining a particular API can become too high.
   -

   User Costs - APIs also have a cognitive cost to users learning Spark or
   trying to understand Spark programs. This cost becomes even higher when the
   API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also
high, there are alternatives that should be considered that do not hurt
existing users but do address some of the maintenance costs.


   -

   Avoid Bad APIs - While this is a bit obvious, it is an important point.
   Anytime we are adding a new interface to Spark we should consider that we
   might be stuck with this API forever. Think deeply about how new APIs
   relate to existing ones, as well as how you expect them to evolve over time.
   -

   Deprecation Warnings - All deprecation warnings should point to a clear
   alternative and should never just say that an API is deprecated.
   -

   Updated Docs - Documentation should point to the "best" recommended way
   of performing a given task. In the cases where we maintain legacy
   documentation, we should clearly point to newer APIs and suggest to users
   the "right" way.
   -

   Community Work - Many people learn Spark by reading blogs and other
   sites such as StackOverflow. However, many of these resources are out of
   date. Update them, to reduce the cost of eventually removing deprecated
   APIs.


</new policy>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Jules Damji <dm...@comcast.net>.

+1 (non-binding) 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Mar 6, 2020, at 7:09 PM, Sean Owen <sr...@gmail.com> wrote:
> 
> +1
> 
>> On Fri, Mar 6, 2020 at 8:59 PM Michael Armbrust <mi...@databricks.com> wrote:
>> 
>> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
>> 
>> 
>> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
>> 
>> 
>> [ ] +1 - Spark should adopt this policy.
>> 
>> [ ] -1  - Spark should not adopt this policy.
>> 
>> 
>> <new policy>
>> 
>> 
>> Considerations When Breaking APIs
>> 
>> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>> 
>> 
>> Cost of Breaking an API
>> 
>> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>> 
>> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>> 
>> How long has the API been in Spark?
>> 
>> Is the API common even for basic programs?
>> 
>> How often do we see recent questions in JIRA or mailing lists?
>> 
>> How often does it appear in StackOverflow or blogs?
>> 
>> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>> 
>> Will there be a compiler or linker error?
>> 
>> Will there be a runtime exception?
>> 
>> Will that exception happen after significant processing has been done?
>> 
>> Will we silently return different answers? (very hard to debug, might not even notice!)
>> 
>> 
>> Cost of Maintaining an API
>> 
>> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>> 
>> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>> 
>> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>> 
>> 
>> Alternatives to Breaking an API
>> 
>> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>> 
>> 
>> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>> 
>> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>> 
>> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>> 
>> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>> 
>> 
>> </new policy>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Sean Owen <sr...@gmail.com>.

+1

On Fri, Mar 6, 2020 at 8:59 PM Michael Armbrust <mi...@databricks.com> wrote:
>
> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
>
>
> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
>
>
> [ ] +1 - Spark should adopt this policy.
>
> [ ] -1  - Spark should not adopt this policy.
>
>
> <new policy>
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>
> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>
> How long has the API been in Spark?
>
> Is the API common even for basic programs?
>
> How often do we see recent questions in JIRA or mailing lists?
>
> How often does it appear in StackOverflow or blogs?
>
> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>
> Will there be a compiler or linker error?
>
> Will there be a runtime exception?
>
> Will that exception happen after significant processing has been done?
>
> Will we silently return different answers? (very hard to debug, might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>
> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>
> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>
>
> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>
> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>
> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>
> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>
>
> </new policy>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

+1 (binding)

I also assume that the implementation of the proposal will be executed
carefully case-by-case via enough open discussions.

Thanks,
Dongjoon.

On Mon, Mar 9, 2020 at 5:20 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> +1 (binding) on the original proposal.
>
> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer <he...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> I am disappointed however that this only mentions API and not
>> dependencies and transitive dependencies.
>>
> I think upgrading dependencies continues to be reasonable.
>
>>
>> As Spark does not provide separation between its runtime classpath and
>> the classpath used by applications, I believe Spark's dependencies and
>> transitive dependencies should be considered part of the API for this
>> policy.  Breaking dependency upgrades and incompatible dependency versions
>> are the source of much frustration.
>>
> I my self have also face this frustration. I believe we've increased some
> shading to help here. Are there specific pain points you've  experienced?
> Maybe we can factor this discussion into another thread
>
>>
>>
>
>>    michael
>>
>>
>> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN <ue...@happy-camper.st> wrote:
>>
>> +1 (binding)
>>
>>
>> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <ji...@gmail.com>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Cheers,
>>>
>>> Xingbo
>>>
>>> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> Xiao
>>>>
>>>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The proposal itself seems good as the factors to consider, Thanks
>>>>>> Michael.
>>>>>>
>>>>>> Several concerns mentioned look good points, in particular:
>>>>>>
>>>>>> > ... assuming that this is for public stable APIs, not APIs that are
>>>>>> marked as unstable, evolving, etc. ...
>>>>>> I would like to confirm this. We already have API annotations such as
>>>>>> Experimental, Unstable, etc. and the implication of each is still
>>>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>>>
>>>>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>>>>> proposing to diverge from semver. ...
>>>>>> I think this is a good point. If we're proposing to divert
>>>>>> from semver, the delta compared to semver will have to be clarified to
>>>>>> avoid different personal interpretations of the somewhat general principles.
>>>>>>
>>>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>>>> Apache Spark 3.0+? ...
>>>>>>
>>>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>>>
>>>>>>
>>>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이
>>>>>> 작성:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Bests,
>>>>>>> Takeshi
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> Gengliang
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <
>>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> +1 as well.
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> +1 (binding), assuming that this is for public stable APIs, not
>>>>>>>>> APIs that are marked as unstable, evolving, etc.
>>>>>>>>>
>>>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> Michael's section on the trade-offs of maintaining / removing an
>>>>>>>>>> API are one of
>>>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > This new policy has a good indention, but can we narrow down on
>>>>>>>>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>>>> >
>>>>>>>>>> > I saw that there already exists a reverting PR to bring back
>>>>>>>>>> Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>>>> >
>>>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>>>>> difficulty, and it's nice.
>>>>>>>>>> >
>>>>>>>>>> > However, for the other cases, it sounds like `recommending
>>>>>>>>>> older APIs as much as possible` due to the following.
>>>>>>>>>> >
>>>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>>>> >
>>>>>>>>>> > We had better be more careful when we add a new policy and
>>>>>>>>>> should aim not to mislead the users and 3rd party library developers to say
>>>>>>>>>> "older is better".
>>>>>>>>>> >
>>>>>>>>>> > Technically, I'm wondering who will use new APIs in their
>>>>>>>>>> examples (of books and StackOverflow) if they need to write an additional
>>>>>>>>>> warning like `this only works at 2.4.0+` always .
>>>>>>>>>> >
>>>>>>>>>> > Bests,
>>>>>>>>>> > Dongjoon.
>>>>>>>>>> >
>>>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>>>>> mridul@gmail.com> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>>>>> prefer
>>>>>>>>>> >> stable well designed API's :-)
>>>>>>>>>> >>
>>>>>>>>>> >> Can we tie the proposal to stability guarantees given by spark
>>>>>>>>>> and
>>>>>>>>>> >> reasonable expectation from users ?
>>>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>>>> >> conservatively handled.
>>>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>>>> >>
>>>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since
>>>>>>>>>> we are
>>>>>>>>>> >> proposing to diverge from semver.
>>>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>>>>> 'impact'
>>>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>>>> >>
>>>>>>>>>> >> Regards,
>>>>>>>>>> >> Mridul
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>> >> >
>>>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>>>> >> >
>>>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm.
>>>>>>>>>> As this is a procedural vote, the measure will pass if there are more
>>>>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> <new policy>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or
>>>>>>>>>> silently changing behavior, even at major versions. While this is not
>>>>>>>>>> always possible, the balance of the following factors should be considered
>>>>>>>>>> before choosing to break an API.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>>>>>> thinking about what the cost will be:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>>>> places, is always very costly to break. While it is hard to know usage for
>>>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing
>>>>>>>>>> lists?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>>>>> increasing severity:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>>>> been done?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>>>> debug, might not even notice!)
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>>>>> users of keeping the API in question.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>>>> needs to keep working as other parts of the project changes. These costs
>>>>>>>>>> are significantly exacerbated when external dependencies change (the JVM,
>>>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>>>> learning Spark or trying to understand Spark programs. This cost becomes
>>>>>>>>>> even higher when the API in question has confusing or undefined semantics.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>>>> evolve over time.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should
>>>>>>>>>> point to a clear alternative and should never just say that an API is
>>>>>>>>>> deprecated.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>>>>> users the "right" way.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs
>>>>>>>>>> and other sites such as StackOverflow. However, many of these resources are
>>>>>>>>>> out of date. Update them, to reduce the cost of eventually removing
>>>>>>>>>> deprecated APIs.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> </new policy>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>> >>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ---
>>>>>>> Takeshi Yamamuro
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> Takuya UESHIN
>>
>> http://twitter.com/ueshin
>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Holden Karau <ho...@pigscanfly.ca>.

+1 (binding) on the original proposal.

On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer <he...@gmail.com> wrote:

> +1 (non-binding)
>
> I am disappointed however that this only mentions API and not dependencies
> and transitive dependencies.
>
I think upgrading dependencies continues to be reasonable.

>
> As Spark does not provide separation between its runtime classpath and the
> classpath used by applications, I believe Spark's dependencies and
> transitive dependencies should be considered part of the API for this
> policy.  Breaking dependency upgrades and incompatible dependency versions
> are the source of much frustration.
>
I my self have also face this frustration. I believe we've increased some
shading to help here. Are there specific pain points you've  experienced?
Maybe we can factor this discussion into another thread

>
>

>    michael
>
>
> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN <ue...@happy-camper.st> wrote:
>
> +1 (binding)
>
>
> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <ji...@gmail.com>
> wrote:
>
>> +1 (non-binding)
>>
>> Cheers,
>>
>> Xingbo
>>
>> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Xiao
>>>
>>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com>
>>>> wrote:
>>>>
>>>>> The proposal itself seems good as the factors to consider, Thanks
>>>>> Michael.
>>>>>
>>>>> Several concerns mentioned look good points, in particular:
>>>>>
>>>>> > ... assuming that this is for public stable APIs, not APIs that are
>>>>> marked as unstable, evolving, etc. ...
>>>>> I would like to confirm this. We already have API annotations such as
>>>>> Experimental, Unstable, etc. and the implication of each is still
>>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>>
>>>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>>>> proposing to diverge from semver. ...
>>>>> I think this is a good point. If we're proposing to divert
>>>>> from semver, the delta compared to semver will have to be clarified to
>>>>> avoid different personal interpretations of the somewhat general principles.
>>>>>
>>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>>> Apache Spark 3.0+? ...
>>>>>
>>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>>
>>>>>
>>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Bests,
>>>>>> Takeshi
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <
>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 as well.
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1 (binding), assuming that this is for public stable APIs, not
>>>>>>>> APIs that are marked as unstable, evolving, etc.
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Michael's section on the trade-offs of maintaining / removing an
>>>>>>>>> API are one of
>>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>>
>>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > This new policy has a good indention, but can we narrow down on
>>>>>>>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>>> >
>>>>>>>>> > I saw that there already exists a reverting PR to bring back
>>>>>>>>> Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>>> >
>>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>>>> difficulty, and it's nice.
>>>>>>>>> >
>>>>>>>>> > However, for the other cases, it sounds like `recommending older
>>>>>>>>> APIs as much as possible` due to the following.
>>>>>>>>> >
>>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>>> >
>>>>>>>>> > We had better be more careful when we add a new policy and
>>>>>>>>> should aim not to mislead the users and 3rd party library developers to say
>>>>>>>>> "older is better".
>>>>>>>>> >
>>>>>>>>> > Technically, I'm wondering who will use new APIs in their
>>>>>>>>> examples (of books and StackOverflow) if they need to write an additional
>>>>>>>>> warning like `this only works at 2.4.0+` always .
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>> >
>>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>>>> mridul@gmail.com> wrote:
>>>>>>>>> >>
>>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>>>> prefer
>>>>>>>>> >> stable well designed API's :-)
>>>>>>>>> >>
>>>>>>>>> >> Can we tie the proposal to stability guarantees given by spark
>>>>>>>>> and
>>>>>>>>> >> reasonable expectation from users ?
>>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>>> >> conservatively handled.
>>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>>> >>
>>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since
>>>>>>>>> we are
>>>>>>>>> >> proposing to diverge from semver.
>>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>>>> 'impact'
>>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>>> >>
>>>>>>>>> >> Regards,
>>>>>>>>> >> Mridul
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>> >> >
>>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>>> >> >
>>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>> >> >>
>>>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm.
>>>>>>>>> As this is a procedural vote, the measure will pass if there are more
>>>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> <new policy>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>> >> >>
>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>>>> possible, the balance of the following factors should be considered before
>>>>>>>>> choosing to break an API.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>>>>> thinking about what the cost will be:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>>> places, is always very costly to break. While it is hard to know usage for
>>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>>> >> >>
>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing
>>>>>>>>> lists?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>>>> increasing severity:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>>> been done?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>>> debug, might not even notice!)
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>>>> users of keeping the API in question.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>>> needs to keep working as other parts of the project changes. These costs
>>>>>>>>> are significantly exacerbated when external dependencies change (the JVM,
>>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>> >> >>
>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>>> learning Spark or trying to understand Spark programs. This cost becomes
>>>>>>>>> even higher when the API in question has confusing or undefined semantics.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>>> evolve over time.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>>>> to a clear alternative and should never just say that an API is deprecated.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>>>> users the "right" way.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs
>>>>>>>>> and other sites such as StackOverflow. However, many of these resources are
>>>>>>>>> out of date. Update them, to reduce the cost of eventually removing
>>>>>>>>> deprecated APIs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> </new policy>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> Takuya UESHIN
>
> http://twitter.com/ueshin
>
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Burak Yavuz <br...@gmail.com>.

+1

On Mon, Mar 9, 2020 at 4:55 PM Reynold Xin <rx...@databricks.com> wrote:

> +1
>
>
>
> On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge <jz...@apache.org> wrote:
>
>> +1 (non-binding)
>>
>> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer <he...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> I am disappointed however that this only mentions API and not
>>> dependencies and transitive dependencies.
>>>
>>> As Spark does not provide separation between its runtime classpath and
>>> the classpath used by applications, I believe Spark's dependencies and
>>> transitive dependencies should be considered part of the API for this
>>> policy.  Breaking dependency upgrades and incompatible dependency versions
>>> are the source of much frustration.
>>>
>>>    michael
>>>
>>>
>>> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN <ue...@happy-camper.st>
>>> wrote:
>>>
>>> +1 (binding)
>>>
>>>
>>> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Cheers,
>>>>
>>>> Xingbo
>>>>
>>>> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> Xiao
>>>>>
>>>>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The proposal itself seems good as the factors to consider, Thanks
>>>>>>> Michael.
>>>>>>>
>>>>>>> Several concerns mentioned look good points, in particular:
>>>>>>>
>>>>>>> > ... assuming that this is for public stable APIs, not APIs that
>>>>>>> are marked as unstable, evolving, etc. ...
>>>>>>> I would like to confirm this. We already have API annotations such
>>>>>>> as Experimental, Unstable, etc. and the implication of each is still
>>>>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>>>>
>>>>>>> > ... can we expand on 'when' an API change can occur ?  Since we
>>>>>>> are proposing to diverge from semver. ...
>>>>>>> I think this is a good point. If we're proposing to divert
>>>>>>> from semver, the delta compared to semver will have to be clarified to
>>>>>>> avoid different personal interpretations of the somewhat general principles.
>>>>>>>
>>>>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>>>>> Apache Spark 3.0+? ...
>>>>>>>
>>>>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>>>>
>>>>>>>
>>>>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이
>>>>>>> 작성:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Takeshi
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Gengliang
>>>>>>>>>
>>>>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <
>>>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1 as well.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> +1 (binding), assuming that this is for public stable APIs, not
>>>>>>>>>> APIs that are marked as unstable, evolving, etc.
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>
>>>>>>>>>>> Michael's section on the trade-offs of maintaining / removing an
>>>>>>>>>>> API are one of
>>>>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > This new policy has a good indention, but can we narrow down
>>>>>>>>>>> on the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>>>>> >
>>>>>>>>>>> > I saw that there already exists a reverting PR to bring back
>>>>>>>>>>> Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>>>>> >
>>>>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>>>>>> difficulty, and it's nice.
>>>>>>>>>>> >
>>>>>>>>>>> > However, for the other cases, it sounds like `recommending
>>>>>>>>>>> older APIs as much as possible` due to the following.
>>>>>>>>>>> >
>>>>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>>>>> >
>>>>>>>>>>> > We had better be more careful when we add a new policy and
>>>>>>>>>>> should aim not to mislead the users and 3rd party library developers to say
>>>>>>>>>>> "older is better".
>>>>>>>>>>> >
>>>>>>>>>>> > Technically, I'm wondering who will use new APIs in their
>>>>>>>>>>> examples (of books and StackOverflow) if they need to write an additional
>>>>>>>>>>> warning like `this only works at 2.4.0+` always .
>>>>>>>>>>> >
>>>>>>>>>>> > Bests,
>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>> >
>>>>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>>>>>> mridul@gmail.com> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>>>>>> prefer
>>>>>>>>>>> >> stable well designed API's :-)
>>>>>>>>>>> >>
>>>>>>>>>>> >> Can we tie the proposal to stability guarantees given by
>>>>>>>>>>> spark and
>>>>>>>>>>> >> reasonable expectation from users ?
>>>>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>>>>> >> conservatively handled.
>>>>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?
>>>>>>>>>>> Since we are
>>>>>>>>>>> >> proposing to diverge from semver.
>>>>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>>>>>> 'impact'
>>>>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>>>>> >>
>>>>>>>>>>> >> Regards,
>>>>>>>>>>> >> Mridul
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm.
>>>>>>>>>>> As this is a procedural vote, the measure will pass if there are more
>>>>>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> <new policy>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or
>>>>>>>>>>> silently changing behavior, even at major versions. While this is not
>>>>>>>>>>> always possible, the balance of the following factors should be considered
>>>>>>>>>>> before choosing to break an API.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to
>>>>>>>>>>> the users of Spark. A broken API means that Spark programs need to be
>>>>>>>>>>> rewritten before they can be upgraded. However, there are a few
>>>>>>>>>>> considerations when thinking about what the cost will be:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>>>>> places, is always very costly to break. While it is hard to know usage for
>>>>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing
>>>>>>>>>>> lists?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>>>>>> increasing severity:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Will that exception happen after significant processing
>>>>>>>>>>> has been done?
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>>>>> debug, might not even notice!)
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Of course, the above does not mean that we will never
>>>>>>>>>>> break any APIs. We must also consider the cost both to the project and to
>>>>>>>>>>> our users of keeping the API in question.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>>>>> needs to keep working as other parts of the project changes. These costs
>>>>>>>>>>> are significantly exacerbated when external dependencies change (the JVM,
>>>>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>>>>> learning Spark or trying to understand Spark programs. This cost becomes
>>>>>>>>>>> even higher when the API in question has confusing or undefined semantics.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>>>>> evolve over time.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should
>>>>>>>>>>> point to a clear alternative and should never just say that an API is
>>>>>>>>>>> deprecated.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>>>>>> users the "right" way.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs
>>>>>>>>>>> and other sites such as StackOverflow. However, many of these resources are
>>>>>>>>>>> out of date. Update them, to reduce the cost of eventually removing
>>>>>>>>>>> deprecated APIs.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> </new policy>
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>> >>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> Takeshi Yamamuro
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> Takuya UESHIN
>>>
>>> http://twitter.com/ueshin
>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Reynold Xin <rx...@databricks.com>.

+1

On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzhuge@apache.org > wrote:

> 
> +1 (non-binding)
> 
> 
> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com (
> heuermh@gmail.com ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> I am disappointed however that this only mentions API and not dependencies
>> and transitive dependencies.
>> 
>> 
>> As Spark does not provide separation between its runtime classpath and the
>> classpath used by applications, I believe Spark's dependencies and
>> transitive dependencies should be considered part of the API for this
>> policy.  Breaking dependency upgrades and incompatible dependency versions
>> are the source of much frustration.
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN < ueshin@ happy-camper. st (
>>> ueshin@happy-camper.st ) > wrote:
>>> 
>>> +1 (binding)
>>> 
>>> 
>>> 
>>> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang < jiangxb1987@ gmail. com (
>>> jiangxb1987@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1 (non-binding)
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> 
>>>> Xingbo
>>>> 
>>>> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li < lixiao@ databricks. com (
>>>> lixiao@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> +1 (binding)
>>>>> 
>>>>> 
>>>>> Xiao
>>>>> 
>>>>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee < denny. g. lee@ gmail. com (
>>>>> denny.g.lee@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> +1 (non-binding)
>>>>>> 
>>>>>> 
>>>>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon < gurwls223@ gmail. com (
>>>>>> gurwls223@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> The proposal itself seems good as the factors to consider, Thanks Michael.
>>>>>>> 
>>>>>>> 
>>>>>>> Several concerns mentioned look good points, in particular:
>>>>>>> 
>>>>>>> > ... assuming that this is for public stable APIs, not APIs that are
>>>>>>> marked as unstable, evolving, etc. ...
>>>>>>> I would like to confirm this. We already have API annotations such as
>>>>>>> Experimental, Unstable, etc. and the implication of each is still
>>>>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>>>> 
>>>>>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>>>>>> proposing to diverge from semver. ...
>>>>>>> 
>>>>>>> I think this is a good point. If we're proposing to divert from semver,
>>>>>>> the delta compared to semver will have to be clarified to avoid different
>>>>>>> personal interpretations of the somewhat general principles.
>>>>>>> 
>>>>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>>>>> Apache Spark 3.0+? ...
>>>>>>> 
>>>>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>>>> 
>>>>>>>  
>>>>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro < linguin. m. s@ gmail. com (
>>>>>>> linguin.m.s@gmail.com ) >님이 작성:
>>>>>>> 
>>>>>>> 
>>>>>>>> +1 (non-binding)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Bests,
>>>>>>>> Takeshi
>>>>>>>> 
>>>>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang < gengliang. wang@ databricks.
>>>>>>>> com ( gengliang.wang@databricks.com ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> +1 (non-binding)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Gengliang
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia < matei. zaharia@ gmail. com
>>>>>>>>> ( matei.zaharia@gmail.com ) > wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> +1 as well.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Matei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan < cloud0fan@ gmail. com (
>>>>>>>>>>> cloud0fan@gmail.com ) > wrote:
>>>>>>>>>>> 
>>>>>>>>>>> +1 (binding), assuming that this is for public stable APIs, not APIs that
>>>>>>>>>>> are marked as unstable, evolving, etc.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía < iemejia@ gmail. com (
>>>>>>>>>>> iemejia@gmail.com ) > wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>> 
>>>>>>>>>>>> Michael's section on the trade-offs of maintaining / removing an API are
>>>>>>>>>>>> one of
>>>>>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>>>>>>>>> dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > This new policy has a good indention, but can we narrow down on the
>>>>>>>>>>>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>>>>>> >
>>>>>>>>>>>> > I saw that there already exists a reverting PR to bring back Spark 1.4
>>>>>>>>>>>> and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>>>>>> >
>>>>>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty,
>>>>>>>>>>>> and it's nice.
>>>>>>>>>>>> >
>>>>>>>>>>>> > However, for the other cases, it sounds like `recommending older APIs as
>>>>>>>>>>>> much as possible` due to the following.
>>>>>>>>>>>> >
>>>>>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>>>>>> >
>>>>>>>>>>>> > We had better be more careful when we add a new policy and should aim
>>>>>>>>>>>> not to mislead the users and 3rd party library developers to say "older is
>>>>>>>>>>>> better".
>>>>>>>>>>>> >
>>>>>>>>>>>> > Technically, I'm wondering who will use new APIs in their examples (of
>>>>>>>>>>>> books and StackOverflow) if they need to write an additional warning like
>>>>>>>>>>>> `this only works at 2.4.0+` always .
>>>>>>>>>>>> >
>>>>>>>>>>>> > Bests,
>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan < mridul@ gmail. com (
>>>>>>>>>>>> mridul@gmail.com ) > wrote:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I prefer
>>>>>>>>>>>> >> stable well designed API's :-)
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Can we tie the proposal to stability guarantees given by spark and
>>>>>>>>>>>> >> reasonable expectation from users ?
>>>>>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>>>>>> >> conservatively handled.
>>>>>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since we are
>>>>>>>>>>>> >> proposing to diverge from semver.
>>>>>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
>>>>>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Regards,
>>>>>>>>>>>> >> Mridul
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust < michael@ databricks. com
>>>>>>>>>>>> ( michael@databricks.com ) > wrote:
>>>>>>>>>>>> >> >
>>>>>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>>>>>> >> >
>>>>>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust < michael@ databricks.
>>>>>>>>>>>> com ( michael@databricks.com ) > wrote:
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> I propose to add the following text to Spark's Semantic Versioning
>>>>>>>>>>>> policy and adopt it as the rubric that should be used when deciding to
>>>>>>>>>>>> break APIs (even at major versions such as 3.0).
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this
>>>>>>>>>>>> is a procedural vote, the measure will pass if there are more favourable
>>>>>>>>>>>> votes than unfavourable ones. PMC votes are binding, but the community is
>>>>>>>>>>>> encouraged to add their voice to the discussion.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> <new policy>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>>>>>>> possible, the balance of the following factors should be considered before
>>>>>>>>>>>> choosing to break an API.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the users of
>>>>>>>>>>>> Spark. A broken API means that Spark programs need to be rewritten before
>>>>>>>>>>>> they can be upgraded. However, there are a few considerations when
>>>>>>>>>>>> thinking about what the cost will be:
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Usage - an API that is actively used in many different places, is
>>>>>>>>>>>> always very costly to break. While it is hard to know usage for sure,
>>>>>>>>>>>> there are a bunch of ways that we can estimate:
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Behavior after the break - How will a program that works today, work
>>>>>>>>>>>> after the break? The following are listed roughly in order of increasing
>>>>>>>>>>>> severity:
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Will that exception happen after significant processing has been
>>>>>>>>>>>> done?
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Will we silently return different answers? (very hard to debug,
>>>>>>>>>>>> might not even notice!)
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Of course, the above does not mean that we will never break any
>>>>>>>>>>>> APIs. We must also consider the cost both to the project and to our users
>>>>>>>>>>>> of keeping the API in question.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and needs to
>>>>>>>>>>>> keep working as other parts of the project changes. These costs are
>>>>>>>>>>>> significantly exacerbated when external dependencies change (the JVM,
>>>>>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users learning Spark
>>>>>>>>>>>> or trying to understand Spark programs. This cost becomes even higher when
>>>>>>>>>>>> the API in question has confusing or undefined semantics.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of removal
>>>>>>>>>>>> is also high, there are alternatives that should be considered that do not
>>>>>>>>>>>> hurt existing users but do address some of the maintenance costs.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>>>>>>>>>>>> point. Anytime we are adding a new interface to Spark we should consider
>>>>>>>>>>>> that we might be stuck with this API forever. Think deeply about how new
>>>>>>>>>>>> APIs relate to existing ones, as well as how you expect them to evolve
>>>>>>>>>>>> over time.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point to a
>>>>>>>>>>>> clear alternative and should never just say that an API is deprecated.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best" recommended
>>>>>>>>>>>> way of performing a given task. In the cases where we maintain legacy
>>>>>>>>>>>> documentation, we should clearly point to newer APIs and suggest to users
>>>>>>>>>>>> the "right" way.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs and other
>>>>>>>>>>>> sites such as StackOverflow. However, many of these resources are out of
>>>>>>>>>>>> date. Update them, to reduce the cost of eventually removing deprecated
>>>>>>>>>>>> APIs.
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >>
>>>>>>>>>>>> >> >> </new policy>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> ---------------------------------------------------------------------
>>>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>>>>>>>>>> dev-unsubscribe@spark.apache.org )
>>>>>>>>>>>> >>
>>>>>>>>>>>> 
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>>>>>>>>>> dev-unsubscribe@spark.apache.org )
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> Takeshi Yamamuro
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ( https://databricks.com/sparkaisummit/north-america )
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Takuya UESHIN
>>> 
>>> http:/ / twitter. com/ ueshin ( http://twitter.com/ueshin )
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> John Zhuge
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by John Zhuge <jz...@apache.org>.

+1 (non-binding)

On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer <he...@gmail.com> wrote:

> +1 (non-binding)
>
> I am disappointed however that this only mentions API and not dependencies
> and transitive dependencies.
>
> As Spark does not provide separation between its runtime classpath and the
> classpath used by applications, I believe Spark's dependencies and
> transitive dependencies should be considered part of the API for this
> policy.  Breaking dependency upgrades and incompatible dependency versions
> are the source of much frustration.
>
>    michael
>
>
> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN <ue...@happy-camper.st> wrote:
>
> +1 (binding)
>
>
> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <ji...@gmail.com>
> wrote:
>
>> +1 (non-binding)
>>
>> Cheers,
>>
>> Xingbo
>>
>> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Xiao
>>>
>>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com>
>>>> wrote:
>>>>
>>>>> The proposal itself seems good as the factors to consider, Thanks
>>>>> Michael.
>>>>>
>>>>> Several concerns mentioned look good points, in particular:
>>>>>
>>>>> > ... assuming that this is for public stable APIs, not APIs that are
>>>>> marked as unstable, evolving, etc. ...
>>>>> I would like to confirm this. We already have API annotations such as
>>>>> Experimental, Unstable, etc. and the implication of each is still
>>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>>
>>>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>>>> proposing to diverge from semver. ...
>>>>> I think this is a good point. If we're proposing to divert
>>>>> from semver, the delta compared to semver will have to be clarified to
>>>>> avoid different personal interpretations of the somewhat general principles.
>>>>>
>>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>>> Apache Spark 3.0+? ...
>>>>>
>>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>>
>>>>>
>>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Bests,
>>>>>> Takeshi
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>>>> gengliang.wang@databricks.com> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Gengliang
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <
>>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 as well.
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1 (binding), assuming that this is for public stable APIs, not
>>>>>>>> APIs that are marked as unstable, evolving, etc.
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Michael's section on the trade-offs of maintaining / removing an
>>>>>>>>> API are one of
>>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>>
>>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > This new policy has a good indention, but can we narrow down on
>>>>>>>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>>> >
>>>>>>>>> > I saw that there already exists a reverting PR to bring back
>>>>>>>>> Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>>> >
>>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>>>> difficulty, and it's nice.
>>>>>>>>> >
>>>>>>>>> > However, for the other cases, it sounds like `recommending older
>>>>>>>>> APIs as much as possible` due to the following.
>>>>>>>>> >
>>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>>> >
>>>>>>>>> > We had better be more careful when we add a new policy and
>>>>>>>>> should aim not to mislead the users and 3rd party library developers to say
>>>>>>>>> "older is better".
>>>>>>>>> >
>>>>>>>>> > Technically, I'm wondering who will use new APIs in their
>>>>>>>>> examples (of books and StackOverflow) if they need to write an additional
>>>>>>>>> warning like `this only works at 2.4.0+` always .
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>> >
>>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>>>> mridul@gmail.com> wrote:
>>>>>>>>> >>
>>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>>>> prefer
>>>>>>>>> >> stable well designed API's :-)
>>>>>>>>> >>
>>>>>>>>> >> Can we tie the proposal to stability guarantees given by spark
>>>>>>>>> and
>>>>>>>>> >> reasonable expectation from users ?
>>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>>> >> conservatively handled.
>>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>>> >>
>>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since
>>>>>>>>> we are
>>>>>>>>> >> proposing to diverge from semver.
>>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>>>> 'impact'
>>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>>> >>
>>>>>>>>> >> Regards,
>>>>>>>>> >> Mridul
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>> >> >
>>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>>> >> >
>>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>> >> >>
>>>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm.
>>>>>>>>> As this is a procedural vote, the measure will pass if there are more
>>>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> <new policy>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>>> >> >>
>>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>>>> possible, the balance of the following factors should be considered before
>>>>>>>>> choosing to break an API.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>>>>> thinking about what the cost will be:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>>> places, is always very costly to break. While it is hard to know usage for
>>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>>> >> >>
>>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing
>>>>>>>>> lists?
>>>>>>>>> >> >>
>>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>>>> increasing severity:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>>> been done?
>>>>>>>>> >> >>
>>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>>> debug, might not even notice!)
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>>>> users of keeping the API in question.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>>> needs to keep working as other parts of the project changes. These costs
>>>>>>>>> are significantly exacerbated when external dependencies change (the JVM,
>>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>>> >> >>
>>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>>> learning Spark or trying to understand Spark programs. This cost becomes
>>>>>>>>> even higher when the API in question has confusing or undefined semantics.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>>> evolve over time.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>>>> to a clear alternative and should never just say that an API is deprecated.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>>>> users the "right" way.
>>>>>>>>> >> >>
>>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs
>>>>>>>>> and other sites such as StackOverflow. However, many of these resources are
>>>>>>>>> out of date. Update them, to reduce the cost of eventually removing
>>>>>>>>> deprecated APIs.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> </new policy>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> Takuya UESHIN
>
> http://twitter.com/ueshin
>
>
>

-- 
John Zhuge

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Michael Heuer <he...@gmail.com>.

+1 (non-binding)

I am disappointed however that this only mentions API and not dependencies and transitive dependencies.

As Spark does not provide separation between its runtime classpath and the classpath used by applications, I believe Spark's dependencies and transitive dependencies should be considered part of the API for this policy.  Breaking dependency upgrades and incompatible dependency versions are the source of much frustration.

   michael


> On Mar 9, 2020, at 2:16 PM, Takuya UESHIN <ue...@happy-camper.st> wrote:
> 
> +1 (binding)
> 
> 
> On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <jiangxb1987@gmail.com <ma...@gmail.com>> wrote:
> +1 (non-binding)
> 
> Cheers,
> 
> Xingbo
> 
> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <lixiao@databricks.com <ma...@databricks.com>> wrote:
> +1 (binding)
> 
> Xiao
> 
> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <denny.g.lee@gmail.com <ma...@gmail.com>> wrote:
> +1 (non-binding)
> 
> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gurwls223@gmail.com <ma...@gmail.com>> wrote:
> The proposal itself seems good as the factors to consider, Thanks Michael.
> 
> Several concerns mentioned look good points, in particular:
> 
> > ... assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc. ...
> I would like to confirm this. We already have API annotations such as Experimental, Unstable, etc. and the implication of each is still effective. If it's for stable APIs, it makes sense to me as well.
> 
> > ... can we expand on 'when' an API change can occur ?  Since we are proposing to diverge from semver. ...
> I think this is a good point. If we're proposing to divert from semver, the delta compared to semver will have to be clarified to avoid different personal interpretations of the somewhat general principles.
> 
> > ... can we narrow down on the migration from Apache Spark 2.4.5 to Apache Spark 3.0+? ...
> 
> Assuming these concerns will be addressed, +1 (binding).
> 
>  
> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <linguin.m.s@gmail.com <ma...@gmail.com>>님이 작성:
> +1 (non-binding)
> 
> Bests,
> Takeshi
> 
> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <gengliang.wang@databricks.com <ma...@databricks.com>> wrote:
> +1 (non-binding)
> 
> Gengliang
> 
> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <matei.zaharia@gmail.com <ma...@gmail.com>> wrote:
> +1 as well.
> 
> Matei
> 
>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cloud0fan@gmail.com <ma...@gmail.com>> wrote:
>> 
>> +1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc.
>> 
>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
>> +1 (non-binding)
>> 
>> Michael's section on the trade-offs of maintaining / removing an API are one of
>> the best reads I have seeing in this mailing list. Enthusiast +1
>> 
>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
>> >
>> > This new policy has a good indention, but can we narrow down on the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>> >
>> > I saw that there already exists a reverting PR to bring back Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>> >
>> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and it's nice.
>> >
>> > However, for the other cases, it sounds like `recommending older APIs as much as possible` due to the following.
>> >
>> >      > How long has the API been in Spark?
>> >
>> > We had better be more careful when we add a new policy and should aim not to mislead the users and 3rd party library developers to say "older is better".
>> >
>> > Technically, I'm wondering who will use new APIs in their examples (of books and StackOverflow) if they need to write an additional warning like `this only works at 2.4.0+` always .
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mridul@gmail.com <ma...@gmail.com>> wrote:
>> >>
>> >> I am in broad agreement with the prposal, as any developer, I prefer
>> >> stable well designed API's :-)
>> >>
>> >> Can we tie the proposal to stability guarantees given by spark and
>> >> reasonable expectation from users ?
>> >> In my opinion, an unstable or evolving could change - while an
>> >> experimental api which has been around for ages should be more
>> >> conservatively handled.
>> >> Which brings in question what are the stability guarantees as
>> >> specified by annotations interacting with the proposal.
>> >>
>> >> Also, can we expand on 'when' an API change can occur ?  Since we are
>> >> proposing to diverge from semver.
>> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
>> >> of API ? Stability guarantees ?
>> >>
>> >> Regards,
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>> >> >
>> >> > I'll start off the vote with a strong +1 (binding).
>> >> >
>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>> >> >>
>> >> >> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
>> >> >>
>> >> >>
>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
>> >> >>
>> >> >>
>> >> >> [ ] +1 - Spark should adopt this policy.
>> >> >>
>> >> >> [ ] -1  - Spark should not adopt this policy.
>> >> >>
>> >> >>
>> >> >> <new policy>
>> >> >>
>> >> >>
>> >> >> Considerations When Breaking APIs
>> >> >>
>> >> >> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>> >> >>
>> >> >>
>> >> >> Cost of Breaking an API
>> >> >>
>> >> >> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>> >> >>
>> >> >> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>> >> >>
>> >> >> How long has the API been in Spark?
>> >> >>
>> >> >> Is the API common even for basic programs?
>> >> >>
>> >> >> How often do we see recent questions in JIRA or mailing lists?
>> >> >>
>> >> >> How often does it appear in StackOverflow or blogs?
>> >> >>
>> >> >> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>> >> >>
>> >> >> Will there be a compiler or linker error?
>> >> >>
>> >> >> Will there be a runtime exception?
>> >> >>
>> >> >> Will that exception happen after significant processing has been done?
>> >> >>
>> >> >> Will we silently return different answers? (very hard to debug, might not even notice!)
>> >> >>
>> >> >>
>> >> >> Cost of Maintaining an API
>> >> >>
>> >> >> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>> >> >>
>> >> >> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>> >> >>
>> >> >> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>> >> >>
>> >> >>
>> >> >> Alternatives to Breaking an API
>> >> >>
>> >> >> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>> >> >>
>> >> >>
>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>> >> >>
>> >> >> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>> >> >>
>> >> >> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>> >> >>
>> >> >> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>> >> >>
>> >> >>
>> >> >> </new policy>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> >>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> -- 
>  <https://databricks.com/sparkaisummit/north-america>
> 
> -- 
> Takuya UESHIN
> 
> http://twitter.com/ueshin <http://twitter.com/ueshin>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Takuya UESHIN <ue...@happy-camper.st>.

+1 (binding)


On Mon, Mar 9, 2020 at 11:49 AM Xingbo Jiang <ji...@gmail.com> wrote:

> +1 (non-binding)
>
> Cheers,
>
> Xingbo
>
> On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:
>
>> +1 (binding)
>>
>> Xiao
>>
>> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>>>
>>>> The proposal itself seems good as the factors to consider, Thanks
>>>> Michael.
>>>>
>>>> Several concerns mentioned look good points, in particular:
>>>>
>>>> > ... assuming that this is for public stable APIs, not APIs that are
>>>> marked as unstable, evolving, etc. ...
>>>> I would like to confirm this. We already have API annotations such as
>>>> Experimental, Unstable, etc. and the implication of each is still
>>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>>
>>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>>> proposing to diverge from semver. ...
>>>> I think this is a good point. If we're proposing to divert from semver,
>>>> the delta compared to semver will have to be clarified to avoid different
>>>> personal interpretations of the somewhat general principles.
>>>>
>>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>>> Apache Spark 3.0+? ...
>>>>
>>>> Assuming these concerns will be addressed, +1 (binding).
>>>>
>>>>
>>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이 작성:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Bests,
>>>>> Takeshi
>>>>>
>>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>>> gengliang.wang@databricks.com> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Gengliang
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <
>>>>>> matei.zaharia@gmail.com> wrote:
>>>>>>
>>>>>>> +1 as well.
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> +1 (binding), assuming that this is for public stable APIs, not APIs
>>>>>>> that are marked as unstable, evolving, etc.
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> Michael's section on the trade-offs of maintaining / removing an
>>>>>>>> API are one of
>>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>>
>>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > This new policy has a good indention, but can we narrow down on
>>>>>>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>>> >
>>>>>>>> > I saw that there already exists a reverting PR to bring back
>>>>>>>> Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>>> >
>>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>>> difficulty, and it's nice.
>>>>>>>> >
>>>>>>>> > However, for the other cases, it sounds like `recommending older
>>>>>>>> APIs as much as possible` due to the following.
>>>>>>>> >
>>>>>>>> >      > How long has the API been in Spark?
>>>>>>>> >
>>>>>>>> > We had better be more careful when we add a new policy and should
>>>>>>>> aim not to mislead the users and 3rd party library developers to say "older
>>>>>>>> is better".
>>>>>>>> >
>>>>>>>> > Technically, I'm wondering who will use new APIs in their
>>>>>>>> examples (of books and StackOverflow) if they need to write an additional
>>>>>>>> warning like `this only works at 2.4.0+` always .
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>> >
>>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>>> mridul@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>>> prefer
>>>>>>>> >> stable well designed API's :-)
>>>>>>>> >>
>>>>>>>> >> Can we tie the proposal to stability guarantees given by spark
>>>>>>>> and
>>>>>>>> >> reasonable expectation from users ?
>>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>>> >> conservatively handled.
>>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>>> >>
>>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since
>>>>>>>> we are
>>>>>>>> >> proposing to diverge from semver.
>>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>>> 'impact'
>>>>>>>> >> of API ? Stability guarantees ?
>>>>>>>> >>
>>>>>>>> >> Regards,
>>>>>>>> >> Mridul
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>> >> >
>>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>>> >> >
>>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>>>>>>> this is a procedural vote, the measure will pass if there are more
>>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>>> >> >>
>>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> <new policy>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>>> >> >>
>>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>>> possible, the balance of the following factors should be considered before
>>>>>>>> choosing to break an API.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>> >> >>
>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>>>> thinking about what the cost will be:
>>>>>>>> >> >>
>>>>>>>> >> >> Usage - an API that is actively used in many different
>>>>>>>> places, is always very costly to break. While it is hard to know usage for
>>>>>>>> sure, there are a bunch of ways that we can estimate:
>>>>>>>> >> >>
>>>>>>>> >> >> How long has the API been in Spark?
>>>>>>>> >> >>
>>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>>> >> >>
>>>>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>>>>> >> >>
>>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>>> >> >>
>>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>>> increasing severity:
>>>>>>>> >> >>
>>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>>> >> >>
>>>>>>>> >> >> Will there be a runtime exception?
>>>>>>>> >> >>
>>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>>> been done?
>>>>>>>> >> >>
>>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>>> debug, might not even notice!)
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Cost of Maintaining an API
>>>>>>>> >> >>
>>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>>> users of keeping the API in question.
>>>>>>>> >> >>
>>>>>>>> >> >> Project Costs - Every API we have needs to be tested and
>>>>>>>> needs to keep working as other parts of the project changes. These costs
>>>>>>>> are significantly exacerbated when external dependencies change (the JVM,
>>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>>> >> >>
>>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users
>>>>>>>> learning Spark or trying to understand Spark programs. This cost becomes
>>>>>>>> even higher when the API in question has confusing or undefined semantics.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>>> >> >>
>>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>>> evolve over time.
>>>>>>>> >> >>
>>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>>> to a clear alternative and should never just say that an API is deprecated.
>>>>>>>> >> >>
>>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>>> users the "right" way.
>>>>>>>> >> >>
>>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>>>>>> other sites such as StackOverflow. However, many of these resources are out
>>>>>>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>>>>>>> APIs.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> </new policy>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>> >>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
Takuya UESHIN

http://twitter.com/ueshin

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Xingbo Jiang <ji...@gmail.com>.

+1 (non-binding)

Cheers,

Xingbo

On Mon, Mar 9, 2020 at 9:35 AM Xiao Li <li...@databricks.com> wrote:

> +1 (binding)
>
> Xiao
>
> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>>> The proposal itself seems good as the factors to consider, Thanks
>>> Michael.
>>>
>>> Several concerns mentioned look good points, in particular:
>>>
>>> > ... assuming that this is for public stable APIs, not APIs that are
>>> marked as unstable, evolving, etc. ...
>>> I would like to confirm this. We already have API annotations such as
>>> Experimental, Unstable, etc. and the implication of each is still
>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>
>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>> proposing to diverge from semver. ...
>>> I think this is a good point. If we're proposing to divert from semver,
>>> the delta compared to semver will have to be clarified to avoid different
>>> personal interpretations of the somewhat general principles.
>>>
>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>> Apache Spark 3.0+? ...
>>>
>>> Assuming these concerns will be addressed, +1 (binding).
>>>
>>>
>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이 작성:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Bests,
>>>> Takeshi
>>>>
>>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>>> gengliang.wang@databricks.com> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Gengliang
>>>>>
>>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 as well.
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>>>>>
>>>>>> +1 (binding), assuming that this is for public stable APIs, not APIs
>>>>>> that are marked as unstable, evolving, etc.
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Michael's section on the trade-offs of maintaining / removing an API
>>>>>>> are one of
>>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>> >
>>>>>>> > This new policy has a good indention, but can we narrow down on
>>>>>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>>> >
>>>>>>> > I saw that there already exists a reverting PR to bring back Spark
>>>>>>> 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>>> >
>>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>>> difficulty, and it's nice.
>>>>>>> >
>>>>>>> > However, for the other cases, it sounds like `recommending older
>>>>>>> APIs as much as possible` due to the following.
>>>>>>> >
>>>>>>> >      > How long has the API been in Spark?
>>>>>>> >
>>>>>>> > We had better be more careful when we add a new policy and should
>>>>>>> aim not to mislead the users and 3rd party library developers to say "older
>>>>>>> is better".
>>>>>>> >
>>>>>>> > Technically, I'm wondering who will use new APIs in their examples
>>>>>>> (of books and StackOverflow) if they need to write an additional warning
>>>>>>> like `this only works at 2.4.0+` always .
>>>>>>> >
>>>>>>> > Bests,
>>>>>>> > Dongjoon.
>>>>>>> >
>>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>>> mridul@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>>> prefer
>>>>>>> >> stable well designed API's :-)
>>>>>>> >>
>>>>>>> >> Can we tie the proposal to stability guarantees given by spark and
>>>>>>> >> reasonable expectation from users ?
>>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>>> >> experimental api which has been around for ages should be more
>>>>>>> >> conservatively handled.
>>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>>> >> specified by annotations interacting with the proposal.
>>>>>>> >>
>>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since we
>>>>>>> are
>>>>>>> >> proposing to diverge from semver.
>>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>>> 'impact'
>>>>>>> >> of API ? Stability guarantees ?
>>>>>>> >>
>>>>>>> >> Regards,
>>>>>>> >> Mridul
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>> >> >
>>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>>> >> >
>>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>> >> >>
>>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>>>>>> this is a procedural vote, the measure will pass if there are more
>>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>>> community is encouraged to add their voice to the discussion.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>>> >> >>
>>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> <new policy>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Considerations When Breaking APIs
>>>>>>> >> >>
>>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>>> changing behavior, even at major versions. While this is not always
>>>>>>> possible, the balance of the following factors should be considered before
>>>>>>> choosing to break an API.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Cost of Breaking an API
>>>>>>> >> >>
>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>>> thinking about what the cost will be:
>>>>>>> >> >>
>>>>>>> >> >> Usage - an API that is actively used in many different places,
>>>>>>> is always very costly to break. While it is hard to know usage for sure,
>>>>>>> there are a bunch of ways that we can estimate:
>>>>>>> >> >>
>>>>>>> >> >> How long has the API been in Spark?
>>>>>>> >> >>
>>>>>>> >> >> Is the API common even for basic programs?
>>>>>>> >> >>
>>>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>>>> >> >>
>>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>>> >> >>
>>>>>>> >> >> Behavior after the break - How will a program that works
>>>>>>> today, work after the break? The following are listed roughly in order of
>>>>>>> increasing severity:
>>>>>>> >> >>
>>>>>>> >> >> Will there be a compiler or linker error?
>>>>>>> >> >>
>>>>>>> >> >> Will there be a runtime exception?
>>>>>>> >> >>
>>>>>>> >> >> Will that exception happen after significant processing has
>>>>>>> been done?
>>>>>>> >> >>
>>>>>>> >> >> Will we silently return different answers? (very hard to
>>>>>>> debug, might not even notice!)
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Cost of Maintaining an API
>>>>>>> >> >>
>>>>>>> >> >> Of course, the above does not mean that we will never break
>>>>>>> any APIs. We must also consider the cost both to the project and to our
>>>>>>> users of keeping the API in question.
>>>>>>> >> >>
>>>>>>> >> >> Project Costs - Every API we have needs to be tested and needs
>>>>>>> to keep working as other parts of the project changes. These costs are
>>>>>>> significantly exacerbated when external dependencies change (the JVM,
>>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>>> the cost of maintaining a particular API can become too high.
>>>>>>> >> >>
>>>>>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>>>>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>>>>>> when the API in question has confusing or undefined semantics.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Alternatives to Breaking an API
>>>>>>> >> >>
>>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>>> evolve over time.
>>>>>>> >> >>
>>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point
>>>>>>> to a clear alternative and should never just say that an API is deprecated.
>>>>>>> >> >>
>>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>>> users the "right" way.
>>>>>>> >> >>
>>>>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>>>>> other sites such as StackOverflow. However, many of these resources are out
>>>>>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>>>>>> APIs.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> </new policy>
>>>>>>> >>
>>>>>>> >>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>> >>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Xiao Li <li...@databricks.com>.

+1 (binding)

Xiao

On Mon, Mar 9, 2020 at 8:33 AM Denny Lee <de...@gmail.com> wrote:

> +1 (non-binding)
>
> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>
>> The proposal itself seems good as the factors to consider, Thanks Michael.
>>
>> Several concerns mentioned look good points, in particular:
>>
>> > ... assuming that this is for public stable APIs, not APIs that are
>> marked as unstable, evolving, etc. ...
>> I would like to confirm this. We already have API annotations such as
>> Experimental, Unstable, etc. and the implication of each is still
>> effective. If it's for stable APIs, it makes sense to me as well.
>>
>> > ... can we expand on 'when' an API change can occur ?  Since we are
>> proposing to diverge from semver. ...
>> I think this is a good point. If we're proposing to divert from semver,
>> the delta compared to semver will have to be clarified to avoid different
>> personal interpretations of the somewhat general principles.
>>
>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>> Apache Spark 3.0+? ...
>>
>> Assuming these concerns will be addressed, +1 (binding).
>>
>>
>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이 작성:
>>
>>> +1 (non-binding)
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>>> gengliang.wang@databricks.com> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Gengliang
>>>>
>>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 as well.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>>>>
>>>>> +1 (binding), assuming that this is for public stable APIs, not APIs
>>>>> that are marked as unstable, evolving, etc.
>>>>>
>>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Michael's section on the trade-offs of maintaining / removing an API
>>>>>> are one of
>>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > This new policy has a good indention, but can we narrow down on the
>>>>>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>>> >
>>>>>> > I saw that there already exists a reverting PR to bring back Spark
>>>>>> 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>>> >
>>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>>> difficulty, and it's nice.
>>>>>> >
>>>>>> > However, for the other cases, it sounds like `recommending older
>>>>>> APIs as much as possible` due to the following.
>>>>>> >
>>>>>> >      > How long has the API been in Spark?
>>>>>> >
>>>>>> > We had better be more careful when we add a new policy and should
>>>>>> aim not to mislead the users and 3rd party library developers to say "older
>>>>>> is better".
>>>>>> >
>>>>>> > Technically, I'm wondering who will use new APIs in their examples
>>>>>> (of books and StackOverflow) if they need to write an additional warning
>>>>>> like `this only works at 2.4.0+` always .
>>>>>> >
>>>>>> > Bests,
>>>>>> > Dongjoon.
>>>>>> >
>>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>>>>> mridul@gmail.com> wrote:
>>>>>> >>
>>>>>> >> I am in broad agreement with the prposal, as any developer, I
>>>>>> prefer
>>>>>> >> stable well designed API's :-)
>>>>>> >>
>>>>>> >> Can we tie the proposal to stability guarantees given by spark and
>>>>>> >> reasonable expectation from users ?
>>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>>> >> experimental api which has been around for ages should be more
>>>>>> >> conservatively handled.
>>>>>> >> Which brings in question what are the stability guarantees as
>>>>>> >> specified by annotations interacting with the proposal.
>>>>>> >>
>>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since we
>>>>>> are
>>>>>> >> proposing to diverge from semver.
>>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>>> 'impact'
>>>>>> >> of API ? Stability guarantees ?
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> Mridul
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>> >> >
>>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>>> >> >
>>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>> >> >>
>>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>>>>> this is a procedural vote, the measure will pass if there are more
>>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>>> community is encouraged to add their voice to the discussion.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>>> >> >>
>>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> <new policy>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Considerations When Breaking APIs
>>>>>> >> >>
>>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>>> changing behavior, even at major versions. While this is not always
>>>>>> possible, the balance of the following factors should be considered before
>>>>>> choosing to break an API.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Cost of Breaking an API
>>>>>> >> >>
>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>>> before they can be upgraded. However, there are a few considerations when
>>>>>> thinking about what the cost will be:
>>>>>> >> >>
>>>>>> >> >> Usage - an API that is actively used in many different places,
>>>>>> is always very costly to break. While it is hard to know usage for sure,
>>>>>> there are a bunch of ways that we can estimate:
>>>>>> >> >>
>>>>>> >> >> How long has the API been in Spark?
>>>>>> >> >>
>>>>>> >> >> Is the API common even for basic programs?
>>>>>> >> >>
>>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>>> >> >>
>>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>>> >> >>
>>>>>> >> >> Behavior after the break - How will a program that works today,
>>>>>> work after the break? The following are listed roughly in order of
>>>>>> increasing severity:
>>>>>> >> >>
>>>>>> >> >> Will there be a compiler or linker error?
>>>>>> >> >>
>>>>>> >> >> Will there be a runtime exception?
>>>>>> >> >>
>>>>>> >> >> Will that exception happen after significant processing has
>>>>>> been done?
>>>>>> >> >>
>>>>>> >> >> Will we silently return different answers? (very hard to debug,
>>>>>> might not even notice!)
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Cost of Maintaining an API
>>>>>> >> >>
>>>>>> >> >> Of course, the above does not mean that we will never break any
>>>>>> APIs. We must also consider the cost both to the project and to our users
>>>>>> of keeping the API in question.
>>>>>> >> >>
>>>>>> >> >> Project Costs - Every API we have needs to be tested and needs
>>>>>> to keep working as other parts of the project changes. These costs are
>>>>>> significantly exacerbated when external dependencies change (the JVM,
>>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>>> the cost of maintaining a particular API can become too high.
>>>>>> >> >>
>>>>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>>>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>>>>> when the API in question has confusing or undefined semantics.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Alternatives to Breaking an API
>>>>>> >> >>
>>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>>> removal is also high, there are alternatives that should be considered that
>>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an
>>>>>> important point. Anytime we are adding a new interface to Spark we should
>>>>>> consider that we might be stuck with this API forever. Think deeply about
>>>>>> how new APIs relate to existing ones, as well as how you expect them to
>>>>>> evolve over time.
>>>>>> >> >>
>>>>>> >> >> Deprecation Warnings - All deprecation warnings should point to
>>>>>> a clear alternative and should never just say that an API is deprecated.
>>>>>> >> >>
>>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>>> users the "right" way.
>>>>>> >> >>
>>>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>>>> other sites such as StackOverflow. However, many of these resources are out
>>>>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>>>>> APIs.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> </new policy>
>>>>>> >>
>>>>>> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>> >>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Denny Lee <de...@gmail.com>.

+1 (non-binding)

On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon <gu...@gmail.com> wrote:

> The proposal itself seems good as the factors to consider, Thanks Michael.
>
> Several concerns mentioned look good points, in particular:
>
> > ... assuming that this is for public stable APIs, not APIs that are
> marked as unstable, evolving, etc. ...
> I would like to confirm this. We already have API annotations such as
> Experimental, Unstable, etc. and the implication of each is still
> effective. If it's for stable APIs, it makes sense to me as well.
>
> > ... can we expand on 'when' an API change can occur ?  Since we are
> proposing to diverge from semver. ...
> I think this is a good point. If we're proposing to divert from semver,
> the delta compared to semver will have to be clarified to avoid different
> personal interpretations of the somewhat general principles.
>
> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
> Apache Spark 3.0+? ...
>
> Assuming these concerns will be addressed, +1 (binding).
>
>
> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이 작성:
>
>> +1 (non-binding)
>>
>> Bests,
>> Takeshi
>>
>> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
>> gengliang.wang@databricks.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Gengliang
>>>
>>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
>>> wrote:
>>>
>>>> +1 as well.
>>>>
>>>> Matei
>>>>
>>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>>>
>>>> +1 (binding), assuming that this is for public stable APIs, not APIs
>>>> that are marked as unstable, evolving, etc.
>>>>
>>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Michael's section on the trade-offs of maintaining / removing an API
>>>>> are one of
>>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>>
>>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > This new policy has a good indention, but can we narrow down on the
>>>>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>>> >
>>>>> > I saw that there already exists a reverting PR to bring back Spark
>>>>> 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>>> >
>>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>>> difficulty, and it's nice.
>>>>> >
>>>>> > However, for the other cases, it sounds like `recommending older
>>>>> APIs as much as possible` due to the following.
>>>>> >
>>>>> >      > How long has the API been in Spark?
>>>>> >
>>>>> > We had better be more careful when we add a new policy and should
>>>>> aim not to mislead the users and 3rd party library developers to say "older
>>>>> is better".
>>>>> >
>>>>> > Technically, I'm wondering who will use new APIs in their examples
>>>>> (of books and StackOverflow) if they need to write an additional warning
>>>>> like `this only works at 2.4.0+` always .
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>> >
>>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I am in broad agreement with the prposal, as any developer, I prefer
>>>>> >> stable well designed API's :-)
>>>>> >>
>>>>> >> Can we tie the proposal to stability guarantees given by spark and
>>>>> >> reasonable expectation from users ?
>>>>> >> In my opinion, an unstable or evolving could change - while an
>>>>> >> experimental api which has been around for ages should be more
>>>>> >> conservatively handled.
>>>>> >> Which brings in question what are the stability guarantees as
>>>>> >> specified by annotations interacting with the proposal.
>>>>> >>
>>>>> >> Also, can we expand on 'when' an API change can occur ?  Since we
>>>>> are
>>>>> >> proposing to diverge from semver.
>>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>>> 'impact'
>>>>> >> of API ? Stability guarantees ?
>>>>> >>
>>>>> >> Regards,
>>>>> >> Mridul
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>> >> >
>>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>>> >> >
>>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>> >> >>
>>>>> >> >> I propose to add the following text to Spark's Semantic
>>>>> Versioning policy and adopt it as the rubric that should be used when
>>>>> deciding to break APIs (even at major versions such as 3.0).
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>>>> this is a procedural vote, the measure will pass if there are more
>>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>>> community is encouraged to add their voice to the discussion.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>>> >> >>
>>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> <new policy>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Considerations When Breaking APIs
>>>>> >> >>
>>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>>> changing behavior, even at major versions. While this is not always
>>>>> possible, the balance of the following factors should be considered before
>>>>> choosing to break an API.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Cost of Breaking an API
>>>>> >> >>
>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>> users of Spark. A broken API means that Spark programs need to be rewritten
>>>>> before they can be upgraded. However, there are a few considerations when
>>>>> thinking about what the cost will be:
>>>>> >> >>
>>>>> >> >> Usage - an API that is actively used in many different places,
>>>>> is always very costly to break. While it is hard to know usage for sure,
>>>>> there are a bunch of ways that we can estimate:
>>>>> >> >>
>>>>> >> >> How long has the API been in Spark?
>>>>> >> >>
>>>>> >> >> Is the API common even for basic programs?
>>>>> >> >>
>>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>>> >> >>
>>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>>> >> >>
>>>>> >> >> Behavior after the break - How will a program that works today,
>>>>> work after the break? The following are listed roughly in order of
>>>>> increasing severity:
>>>>> >> >>
>>>>> >> >> Will there be a compiler or linker error?
>>>>> >> >>
>>>>> >> >> Will there be a runtime exception?
>>>>> >> >>
>>>>> >> >> Will that exception happen after significant processing has been
>>>>> done?
>>>>> >> >>
>>>>> >> >> Will we silently return different answers? (very hard to debug,
>>>>> might not even notice!)
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Cost of Maintaining an API
>>>>> >> >>
>>>>> >> >> Of course, the above does not mean that we will never break any
>>>>> APIs. We must also consider the cost both to the project and to our users
>>>>> of keeping the API in question.
>>>>> >> >>
>>>>> >> >> Project Costs - Every API we have needs to be tested and needs
>>>>> to keep working as other parts of the project changes. These costs are
>>>>> significantly exacerbated when external dependencies change (the JVM,
>>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>>> the cost of maintaining a particular API can become too high.
>>>>> >> >>
>>>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>>>> when the API in question has confusing or undefined semantics.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Alternatives to Breaking an API
>>>>> >> >>
>>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>>> removal is also high, there are alternatives that should be considered that
>>>>> do not hurt existing users but do address some of the maintenance costs.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>>>>> point. Anytime we are adding a new interface to Spark we should consider
>>>>> that we might be stuck with this API forever. Think deeply about how new
>>>>> APIs relate to existing ones, as well as how you expect them to evolve over
>>>>> time.
>>>>> >> >>
>>>>> >> >> Deprecation Warnings - All deprecation warnings should point to
>>>>> a clear alternative and should never just say that an API is deprecated.
>>>>> >> >>
>>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>>> recommended way of performing a given task. In the cases where we maintain
>>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>>> users the "right" way.
>>>>> >> >>
>>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>>> other sites such as StackOverflow. However, many of these resources are out
>>>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>>>> APIs.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> </new policy>
>>>>> >>
>>>>> >>
>>>>> ---------------------------------------------------------------------
>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>> >>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Hyukjin Kwon <gu...@gmail.com>.

The proposal itself seems good as the factors to consider, Thanks Michael.

Several concerns mentioned look good points, in particular:

> ... assuming that this is for public stable APIs, not APIs that are
marked as unstable, evolving, etc. ...
I would like to confirm this. We already have API annotations such as
Experimental, Unstable, etc. and the implication of each is still
effective. If it's for stable APIs, it makes sense to me as well.

> ... can we expand on 'when' an API change can occur ?  Since we are
proposing to diverge from semver. ...
I think this is a good point. If we're proposing to divert from semver, the
delta compared to semver will have to be clarified to avoid different
personal interpretations of the somewhat general principles.

> ... can we narrow down on the migration from Apache Spark 2.4.5 to Apache
Spark 3.0+? ...

Assuming these concerns will be addressed, +1 (binding).


2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro <li...@gmail.com>님이 작성:

> +1 (non-binding)
>
> Bests,
> Takeshi
>
> On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
> gengliang.wang@databricks.com> wrote:
>
>> +1 (non-binding)
>>
>> Gengliang
>>
>> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> +1 as well.
>>>
>>> Matei
>>>
>>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>>
>>> +1 (binding), assuming that this is for public stable APIs, not APIs
>>> that are marked as unstable, evolving, etc.
>>>
>>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Michael's section on the trade-offs of maintaining / removing an API
>>>> are one of
>>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>>
>>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>> >
>>>> > This new policy has a good indention, but can we narrow down on the
>>>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>>> >
>>>> > I saw that there already exists a reverting PR to bring back Spark
>>>> 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>>> >
>>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>>> difficulty, and it's nice.
>>>> >
>>>> > However, for the other cases, it sounds like `recommending older APIs
>>>> as much as possible` due to the following.
>>>> >
>>>> >      > How long has the API been in Spark?
>>>> >
>>>> > We had better be more careful when we add a new policy and should aim
>>>> not to mislead the users and 3rd party library developers to say "older is
>>>> better".
>>>> >
>>>> > Technically, I'm wondering who will use new APIs in their examples
>>>> (of books and StackOverflow) if they need to write an additional warning
>>>> like `this only works at 2.4.0+` always .
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>> >
>>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I am in broad agreement with the prposal, as any developer, I prefer
>>>> >> stable well designed API's :-)
>>>> >>
>>>> >> Can we tie the proposal to stability guarantees given by spark and
>>>> >> reasonable expectation from users ?
>>>> >> In my opinion, an unstable or evolving could change - while an
>>>> >> experimental api which has been around for ages should be more
>>>> >> conservatively handled.
>>>> >> Which brings in question what are the stability guarantees as
>>>> >> specified by annotations interacting with the proposal.
>>>> >>
>>>> >> Also, can we expand on 'when' an API change can occur ?  Since we are
>>>> >> proposing to diverge from semver.
>>>> >> Patch release ? Minor release ? Only major release ? Based on
>>>> 'impact'
>>>> >> of API ? Stability guarantees ?
>>>> >>
>>>> >> Regards,
>>>> >> Mridul
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>> >> >
>>>> >> > I'll start off the vote with a strong +1 (binding).
>>>> >> >
>>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>> >> >>
>>>> >> >> I propose to add the following text to Spark's Semantic
>>>> Versioning policy and adopt it as the rubric that should be used when
>>>> deciding to break APIs (even at major versions such as 3.0).
>>>> >> >>
>>>> >> >>
>>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>>> this is a procedural vote, the measure will pass if there are more
>>>> favourable votes than unfavourable ones. PMC votes are binding, but the
>>>> community is encouraged to add their voice to the discussion.
>>>> >> >>
>>>> >> >>
>>>> >> >> [ ] +1 - Spark should adopt this policy.
>>>> >> >>
>>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>>> >> >>
>>>> >> >>
>>>> >> >> <new policy>
>>>> >> >>
>>>> >> >>
>>>> >> >> Considerations When Breaking APIs
>>>> >> >>
>>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>>> changing behavior, even at major versions. While this is not always
>>>> possible, the balance of the following factors should be considered before
>>>> choosing to break an API.
>>>> >> >>
>>>> >> >>
>>>> >> >> Cost of Breaking an API
>>>> >> >>
>>>> >> >> Breaking an API almost always has a non-trivial cost to the users
>>>> of Spark. A broken API means that Spark programs need to be rewritten
>>>> before they can be upgraded. However, there are a few considerations when
>>>> thinking about what the cost will be:
>>>> >> >>
>>>> >> >> Usage - an API that is actively used in many different places, is
>>>> always very costly to break. While it is hard to know usage for sure, there
>>>> are a bunch of ways that we can estimate:
>>>> >> >>
>>>> >> >> How long has the API been in Spark?
>>>> >> >>
>>>> >> >> Is the API common even for basic programs?
>>>> >> >>
>>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>>> >> >>
>>>> >> >> How often does it appear in StackOverflow or blogs?
>>>> >> >>
>>>> >> >> Behavior after the break - How will a program that works today,
>>>> work after the break? The following are listed roughly in order of
>>>> increasing severity:
>>>> >> >>
>>>> >> >> Will there be a compiler or linker error?
>>>> >> >>
>>>> >> >> Will there be a runtime exception?
>>>> >> >>
>>>> >> >> Will that exception happen after significant processing has been
>>>> done?
>>>> >> >>
>>>> >> >> Will we silently return different answers? (very hard to debug,
>>>> might not even notice!)
>>>> >> >>
>>>> >> >>
>>>> >> >> Cost of Maintaining an API
>>>> >> >>
>>>> >> >> Of course, the above does not mean that we will never break any
>>>> APIs. We must also consider the cost both to the project and to our users
>>>> of keeping the API in question.
>>>> >> >>
>>>> >> >> Project Costs - Every API we have needs to be tested and needs to
>>>> keep working as other parts of the project changes. These costs are
>>>> significantly exacerbated when external dependencies change (the JVM,
>>>> Scala, etc). In some cases, while not completely technically infeasible,
>>>> the cost of maintaining a particular API can become too high.
>>>> >> >>
>>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>>> when the API in question has confusing or undefined semantics.
>>>> >> >>
>>>> >> >>
>>>> >> >> Alternatives to Breaking an API
>>>> >> >>
>>>> >> >> In cases where there is a "Bad API", but where the cost of
>>>> removal is also high, there are alternatives that should be considered that
>>>> do not hurt existing users but do address some of the maintenance costs.
>>>> >> >>
>>>> >> >>
>>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>>>> point. Anytime we are adding a new interface to Spark we should consider
>>>> that we might be stuck with this API forever. Think deeply about how new
>>>> APIs relate to existing ones, as well as how you expect them to evolve over
>>>> time.
>>>> >> >>
>>>> >> >> Deprecation Warnings - All deprecation warnings should point to a
>>>> clear alternative and should never just say that an API is deprecated.
>>>> >> >>
>>>> >> >> Updated Docs - Documentation should point to the "best"
>>>> recommended way of performing a given task. In the cases where we maintain
>>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>>> users the "right" way.
>>>> >> >>
>>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>>> other sites such as StackOverflow. However, many of these resources are out
>>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>>> APIs.
>>>> >> >>
>>>> >> >>
>>>> >> >> </new policy>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Takeshi Yamamuro <li...@gmail.com>.

+1 (non-binding)

Bests,
Takeshi

On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <ge...@databricks.com>
wrote:

> +1 (non-binding)
>
> Gengliang
>
> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> +1 as well.
>>
>> Matei
>>
>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>>
>> +1 (binding), assuming that this is for public stable APIs, not APIs that
>> are marked as unstable, evolving, etc.
>>
>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Michael's section on the trade-offs of maintaining / removing an API are
>>> one of
>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>
>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >
>>> > This new policy has a good indention, but can we narrow down on the
>>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>> >
>>> > I saw that there already exists a reverting PR to bring back Spark 1.4
>>> and 1.5 APIs based on this AS-IS suggestion.
>>> >
>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>> difficulty, and it's nice.
>>> >
>>> > However, for the other cases, it sounds like `recommending older APIs
>>> as much as possible` due to the following.
>>> >
>>> >      > How long has the API been in Spark?
>>> >
>>> > We had better be more careful when we add a new policy and should aim
>>> not to mislead the users and 3rd party library developers to say "older is
>>> better".
>>> >
>>> > Technically, I'm wondering who will use new APIs in their examples (of
>>> books and StackOverflow) if they need to write an additional warning like
>>> `this only works at 2.4.0+` always .
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com>
>>> wrote:
>>> >>
>>> >> I am in broad agreement with the prposal, as any developer, I prefer
>>> >> stable well designed API's :-)
>>> >>
>>> >> Can we tie the proposal to stability guarantees given by spark and
>>> >> reasonable expectation from users ?
>>> >> In my opinion, an unstable or evolving could change - while an
>>> >> experimental api which has been around for ages should be more
>>> >> conservatively handled.
>>> >> Which brings in question what are the stability guarantees as
>>> >> specified by annotations interacting with the proposal.
>>> >>
>>> >> Also, can we expand on 'when' an API change can occur ?  Since we are
>>> >> proposing to diverge from semver.
>>> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
>>> >> of API ? Stability guarantees ?
>>> >>
>>> >> Regards,
>>> >> Mridul
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>> michael@databricks.com> wrote:
>>> >> >
>>> >> > I'll start off the vote with a strong +1 (binding).
>>> >> >
>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>> michael@databricks.com> wrote:
>>> >> >>
>>> >> >> I propose to add the following text to Spark's Semantic Versioning
>>> policy and adopt it as the rubric that should be used when deciding to
>>> break APIs (even at major versions such as 3.0).
>>> >> >>
>>> >> >>
>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this
>>> is a procedural vote, the measure will pass if there are more favourable
>>> votes than unfavourable ones. PMC votes are binding, but the community is
>>> encouraged to add their voice to the discussion.
>>> >> >>
>>> >> >>
>>> >> >> [ ] +1 - Spark should adopt this policy.
>>> >> >>
>>> >> >> [ ] -1  - Spark should not adopt this policy.
>>> >> >>
>>> >> >>
>>> >> >> <new policy>
>>> >> >>
>>> >> >>
>>> >> >> Considerations When Breaking APIs
>>> >> >>
>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>> changing behavior, even at major versions. While this is not always
>>> possible, the balance of the following factors should be considered before
>>> choosing to break an API.
>>> >> >>
>>> >> >>
>>> >> >> Cost of Breaking an API
>>> >> >>
>>> >> >> Breaking an API almost always has a non-trivial cost to the users
>>> of Spark. A broken API means that Spark programs need to be rewritten
>>> before they can be upgraded. However, there are a few considerations when
>>> thinking about what the cost will be:
>>> >> >>
>>> >> >> Usage - an API that is actively used in many different places, is
>>> always very costly to break. While it is hard to know usage for sure, there
>>> are a bunch of ways that we can estimate:
>>> >> >>
>>> >> >> How long has the API been in Spark?
>>> >> >>
>>> >> >> Is the API common even for basic programs?
>>> >> >>
>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>> >> >>
>>> >> >> How often does it appear in StackOverflow or blogs?
>>> >> >>
>>> >> >> Behavior after the break - How will a program that works today,
>>> work after the break? The following are listed roughly in order of
>>> increasing severity:
>>> >> >>
>>> >> >> Will there be a compiler or linker error?
>>> >> >>
>>> >> >> Will there be a runtime exception?
>>> >> >>
>>> >> >> Will that exception happen after significant processing has been
>>> done?
>>> >> >>
>>> >> >> Will we silently return different answers? (very hard to debug,
>>> might not even notice!)
>>> >> >>
>>> >> >>
>>> >> >> Cost of Maintaining an API
>>> >> >>
>>> >> >> Of course, the above does not mean that we will never break any
>>> APIs. We must also consider the cost both to the project and to our users
>>> of keeping the API in question.
>>> >> >>
>>> >> >> Project Costs - Every API we have needs to be tested and needs to
>>> keep working as other parts of the project changes. These costs are
>>> significantly exacerbated when external dependencies change (the JVM,
>>> Scala, etc). In some cases, while not completely technically infeasible,
>>> the cost of maintaining a particular API can become too high.
>>> >> >>
>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>> when the API in question has confusing or undefined semantics.
>>> >> >>
>>> >> >>
>>> >> >> Alternatives to Breaking an API
>>> >> >>
>>> >> >> In cases where there is a "Bad API", but where the cost of removal
>>> is also high, there are alternatives that should be considered that do not
>>> hurt existing users but do address some of the maintenance costs.
>>> >> >>
>>> >> >>
>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>>> point. Anytime we are adding a new interface to Spark we should consider
>>> that we might be stuck with this API forever. Think deeply about how new
>>> APIs relate to existing ones, as well as how you expect them to evolve over
>>> time.
>>> >> >>
>>> >> >> Deprecation Warnings - All deprecation warnings should point to a
>>> clear alternative and should never just say that an API is deprecated.
>>> >> >>
>>> >> >> Updated Docs - Documentation should point to the "best"
>>> recommended way of performing a given task. In the cases where we maintain
>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>> users the "right" way.
>>> >> >>
>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>> other sites such as StackOverflow. However, many of these resources are out
>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>> APIs.
>>> >> >>
>>> >> >>
>>> >> >> </new policy>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>

-- 
---
Takeshi Yamamuro

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Gengliang Wang <ge...@databricks.com>.

+1 (non-binding)

Gengliang

On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia <ma...@gmail.com>
wrote:

> +1 as well.
>
> Matei
>
> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
>
> +1 (binding), assuming that this is for public stable APIs, not APIs that
> are marked as unstable, evolving, etc.
>
> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> Michael's section on the trade-offs of maintaining / removing an API are
>> one of
>> the best reads I have seeing in this mailing list. Enthusiast +1
>>
>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >
>> > This new policy has a good indention, but can we narrow down on the
>> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>> >
>> > I saw that there already exists a reverting PR to bring back Spark 1.4
>> and 1.5 APIs based on this AS-IS suggestion.
>> >
>> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty,
>> and it's nice.
>> >
>> > However, for the other cases, it sounds like `recommending older APIs
>> as much as possible` due to the following.
>> >
>> >      > How long has the API been in Spark?
>> >
>> > We had better be more careful when we add a new policy and should aim
>> not to mislead the users and 3rd party library developers to say "older is
>> better".
>> >
>> > Technically, I'm wondering who will use new APIs in their examples (of
>> books and StackOverflow) if they need to write an additional warning like
>> `this only works at 2.4.0+` always .
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com>
>> wrote:
>> >>
>> >> I am in broad agreement with the prposal, as any developer, I prefer
>> >> stable well designed API's :-)
>> >>
>> >> Can we tie the proposal to stability guarantees given by spark and
>> >> reasonable expectation from users ?
>> >> In my opinion, an unstable or evolving could change - while an
>> >> experimental api which has been around for ages should be more
>> >> conservatively handled.
>> >> Which brings in question what are the stability guarantees as
>> >> specified by annotations interacting with the proposal.
>> >>
>> >> Also, can we expand on 'when' an API change can occur ?  Since we are
>> >> proposing to diverge from semver.
>> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
>> >> of API ? Stability guarantees ?
>> >>
>> >> Regards,
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>> michael@databricks.com> wrote:
>> >> >
>> >> > I'll start off the vote with a strong +1 (binding).
>> >> >
>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>> michael@databricks.com> wrote:
>> >> >>
>> >> >> I propose to add the following text to Spark's Semantic Versioning
>> policy and adopt it as the rubric that should be used when deciding to
>> break APIs (even at major versions such as 3.0).
>> >> >>
>> >> >>
>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this
>> is a procedural vote, the measure will pass if there are more favourable
>> votes than unfavourable ones. PMC votes are binding, but the community is
>> encouraged to add their voice to the discussion.
>> >> >>
>> >> >>
>> >> >> [ ] +1 - Spark should adopt this policy.
>> >> >>
>> >> >> [ ] -1  - Spark should not adopt this policy.
>> >> >>
>> >> >>
>> >> >> <new policy>
>> >> >>
>> >> >>
>> >> >> Considerations When Breaking APIs
>> >> >>
>> >> >> The Spark project strives to avoid breaking APIs or silently
>> changing behavior, even at major versions. While this is not always
>> possible, the balance of the following factors should be considered before
>> choosing to break an API.
>> >> >>
>> >> >>
>> >> >> Cost of Breaking an API
>> >> >>
>> >> >> Breaking an API almost always has a non-trivial cost to the users
>> of Spark. A broken API means that Spark programs need to be rewritten
>> before they can be upgraded. However, there are a few considerations when
>> thinking about what the cost will be:
>> >> >>
>> >> >> Usage - an API that is actively used in many different places, is
>> always very costly to break. While it is hard to know usage for sure, there
>> are a bunch of ways that we can estimate:
>> >> >>
>> >> >> How long has the API been in Spark?
>> >> >>
>> >> >> Is the API common even for basic programs?
>> >> >>
>> >> >> How often do we see recent questions in JIRA or mailing lists?
>> >> >>
>> >> >> How often does it appear in StackOverflow or blogs?
>> >> >>
>> >> >> Behavior after the break - How will a program that works today,
>> work after the break? The following are listed roughly in order of
>> increasing severity:
>> >> >>
>> >> >> Will there be a compiler or linker error?
>> >> >>
>> >> >> Will there be a runtime exception?
>> >> >>
>> >> >> Will that exception happen after significant processing has been
>> done?
>> >> >>
>> >> >> Will we silently return different answers? (very hard to debug,
>> might not even notice!)
>> >> >>
>> >> >>
>> >> >> Cost of Maintaining an API
>> >> >>
>> >> >> Of course, the above does not mean that we will never break any
>> APIs. We must also consider the cost both to the project and to our users
>> of keeping the API in question.
>> >> >>
>> >> >> Project Costs - Every API we have needs to be tested and needs to
>> keep working as other parts of the project changes. These costs are
>> significantly exacerbated when external dependencies change (the JVM,
>> Scala, etc). In some cases, while not completely technically infeasible,
>> the cost of maintaining a particular API can become too high.
>> >> >>
>> >> >> User Costs - APIs also have a cognitive cost to users learning
>> Spark or trying to understand Spark programs. This cost becomes even higher
>> when the API in question has confusing or undefined semantics.
>> >> >>
>> >> >>
>> >> >> Alternatives to Breaking an API
>> >> >>
>> >> >> In cases where there is a "Bad API", but where the cost of removal
>> is also high, there are alternatives that should be considered that do not
>> hurt existing users but do address some of the maintenance costs.
>> >> >>
>> >> >>
>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>> point. Anytime we are adding a new interface to Spark we should consider
>> that we might be stuck with this API forever. Think deeply about how new
>> APIs relate to existing ones, as well as how you expect them to evolve over
>> time.
>> >> >>
>> >> >> Deprecation Warnings - All deprecation warnings should point to a
>> clear alternative and should never just say that an API is deprecated.
>> >> >>
>> >> >> Updated Docs - Documentation should point to the "best" recommended
>> way of performing a given task. In the cases where we maintain legacy
>> documentation, we should clearly point to newer APIs and suggest to users
>> the "right" way.
>> >> >>
>> >> >> Community Work - Many people learn Spark by reading blogs and other
>> sites such as StackOverflow. However, many of these resources are out of
>> date. Update them, to reduce the cost of eventually removing deprecated
>> APIs.
>> >> >>
>> >> >>
>> >> >> </new policy>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Matei Zaharia <ma...@gmail.com>.

+1 as well.

Matei

> On Mar 9, 2020, at 12:05 AM, Wenchen Fan <cl...@gmail.com> wrote:
> 
> +1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc.
> 
> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> +1 (non-binding)
> 
> Michael's section on the trade-offs of maintaining / removing an API are one of
> the best reads I have seeing in this mailing list. Enthusiast +1
> 
> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
> >
> > This new policy has a good indention, but can we narrow down on the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
> >
> > I saw that there already exists a reverting PR to bring back Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
> >
> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and it's nice.
> >
> > However, for the other cases, it sounds like `recommending older APIs as much as possible` due to the following.
> >
> >      > How long has the API been in Spark?
> >
> > We had better be more careful when we add a new policy and should aim not to mislead the users and 3rd party library developers to say "older is better".
> >
> > Technically, I'm wondering who will use new APIs in their examples (of books and StackOverflow) if they need to write an additional warning like `this only works at 2.4.0+` always .
> >
> > Bests,
> > Dongjoon.
> >
> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mridul@gmail.com <ma...@gmail.com>> wrote:
> >>
> >> I am in broad agreement with the prposal, as any developer, I prefer
> >> stable well designed API's :-)
> >>
> >> Can we tie the proposal to stability guarantees given by spark and
> >> reasonable expectation from users ?
> >> In my opinion, an unstable or evolving could change - while an
> >> experimental api which has been around for ages should be more
> >> conservatively handled.
> >> Which brings in question what are the stability guarantees as
> >> specified by annotations interacting with the proposal.
> >>
> >> Also, can we expand on 'when' an API change can occur ?  Since we are
> >> proposing to diverge from semver.
> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
> >> of API ? Stability guarantees ?
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
> >> >
> >> > I'll start off the vote with a strong +1 (binding).
> >> >
> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
> >> >>
> >> >> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
> >> >>
> >> >>
> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
> >> >>
> >> >>
> >> >> [ ] +1 - Spark should adopt this policy.
> >> >>
> >> >> [ ] -1  - Spark should not adopt this policy.
> >> >>
> >> >>
> >> >> <new policy>
> >> >>
> >> >>
> >> >> Considerations When Breaking APIs
> >> >>
> >> >> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
> >> >>
> >> >>
> >> >> Cost of Breaking an API
> >> >>
> >> >> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
> >> >>
> >> >> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
> >> >>
> >> >> How long has the API been in Spark?
> >> >>
> >> >> Is the API common even for basic programs?
> >> >>
> >> >> How often do we see recent questions in JIRA or mailing lists?
> >> >>
> >> >> How often does it appear in StackOverflow or blogs?
> >> >>
> >> >> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
> >> >>
> >> >> Will there be a compiler or linker error?
> >> >>
> >> >> Will there be a runtime exception?
> >> >>
> >> >> Will that exception happen after significant processing has been done?
> >> >>
> >> >> Will we silently return different answers? (very hard to debug, might not even notice!)
> >> >>
> >> >>
> >> >> Cost of Maintaining an API
> >> >>
> >> >> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
> >> >>
> >> >> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
> >> >>
> >> >> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
> >> >>
> >> >>
> >> >> Alternatives to Breaking an API
> >> >>
> >> >> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
> >> >>
> >> >>
> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
> >> >>
> >> >> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
> >> >>
> >> >> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
> >> >>
> >> >> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
> >> >>
> >> >>
> >> >> </new policy>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Wenchen Fan <cl...@gmail.com>.

+1 (binding), assuming that this is for public stable APIs, not APIs that
are marked as unstable, evolving, etc.

On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +1 (non-binding)
>
> Michael's section on the trade-offs of maintaining / removing an API are
> one of
> the best reads I have seeing in this mailing list. Enthusiast +1
>
> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >
> > This new policy has a good indention, but can we narrow down on the
> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
> >
> > I saw that there already exists a reverting PR to bring back Spark 1.4
> and 1.5 APIs based on this AS-IS suggestion.
> >
> > The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty,
> and it's nice.
> >
> > However, for the other cases, it sounds like `recommending older APIs as
> much as possible` due to the following.
> >
> >      > How long has the API been in Spark?
> >
> > We had better be more careful when we add a new policy and should aim
> not to mislead the users and 3rd party library developers to say "older is
> better".
> >
> > Technically, I'm wondering who will use new APIs in their examples (of
> books and StackOverflow) if they need to write an additional warning like
> `this only works at 2.4.0+` always .
> >
> > Bests,
> > Dongjoon.
> >
> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >>
> >> I am in broad agreement with the prposal, as any developer, I prefer
> >> stable well designed API's :-)
> >>
> >> Can we tie the proposal to stability guarantees given by spark and
> >> reasonable expectation from users ?
> >> In my opinion, an unstable or evolving could change - while an
> >> experimental api which has been around for ages should be more
> >> conservatively handled.
> >> Which brings in question what are the stability guarantees as
> >> specified by annotations interacting with the proposal.
> >>
> >> Also, can we expand on 'when' an API change can occur ?  Since we are
> >> proposing to diverge from semver.
> >> Patch release ? Minor release ? Only major release ? Based on 'impact'
> >> of API ? Stability guarantees ?
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <mi...@databricks.com>
> wrote:
> >> >
> >> > I'll start off the vote with a strong +1 (binding).
> >> >
> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
> michael@databricks.com> wrote:
> >> >>
> >> >> I propose to add the following text to Spark's Semantic Versioning
> policy and adopt it as the rubric that should be used when deciding to
> break APIs (even at major versions such as 3.0).
> >> >>
> >> >>
> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this
> is a procedural vote, the measure will pass if there are more favourable
> votes than unfavourable ones. PMC votes are binding, but the community is
> encouraged to add their voice to the discussion.
> >> >>
> >> >>
> >> >> [ ] +1 - Spark should adopt this policy.
> >> >>
> >> >> [ ] -1  - Spark should not adopt this policy.
> >> >>
> >> >>
> >> >> <new policy>
> >> >>
> >> >>
> >> >> Considerations When Breaking APIs
> >> >>
> >> >> The Spark project strives to avoid breaking APIs or silently
> changing behavior, even at major versions. While this is not always
> possible, the balance of the following factors should be considered before
> choosing to break an API.
> >> >>
> >> >>
> >> >> Cost of Breaking an API
> >> >>
> >> >> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
> >> >>
> >> >> Usage - an API that is actively used in many different places, is
> always very costly to break. While it is hard to know usage for sure, there
> are a bunch of ways that we can estimate:
> >> >>
> >> >> How long has the API been in Spark?
> >> >>
> >> >> Is the API common even for basic programs?
> >> >>
> >> >> How often do we see recent questions in JIRA or mailing lists?
> >> >>
> >> >> How often does it appear in StackOverflow or blogs?
> >> >>
> >> >> Behavior after the break - How will a program that works today, work
> after the break? The following are listed roughly in order of increasing
> severity:
> >> >>
> >> >> Will there be a compiler or linker error?
> >> >>
> >> >> Will there be a runtime exception?
> >> >>
> >> >> Will that exception happen after significant processing has been
> done?
> >> >>
> >> >> Will we silently return different answers? (very hard to debug,
> might not even notice!)
> >> >>
> >> >>
> >> >> Cost of Maintaining an API
> >> >>
> >> >> Of course, the above does not mean that we will never break any
> APIs. We must also consider the cost both to the project and to our users
> of keeping the API in question.
> >> >>
> >> >> Project Costs - Every API we have needs to be tested and needs to
> keep working as other parts of the project changes. These costs are
> significantly exacerbated when external dependencies change (the JVM,
> Scala, etc). In some cases, while not completely technically infeasible,
> the cost of maintaining a particular API can become too high.
> >> >>
> >> >> User Costs - APIs also have a cognitive cost to users learning Spark
> or trying to understand Spark programs. This cost becomes even higher when
> the API in question has confusing or undefined semantics.
> >> >>
> >> >>
> >> >> Alternatives to Breaking an API
> >> >>
> >> >> In cases where there is a "Bad API", but where the cost of removal
> is also high, there are alternatives that should be considered that do not
> hurt existing users but do address some of the maintenance costs.
> >> >>
> >> >>
> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
> point. Anytime we are adding a new interface to Spark we should consider
> that we might be stuck with this API forever. Think deeply about how new
> APIs relate to existing ones, as well as how you expect them to evolve over
> time.
> >> >>
> >> >> Deprecation Warnings - All deprecation warnings should point to a
> clear alternative and should never just say that an API is deprecated.
> >> >>
> >> >> Updated Docs - Documentation should point to the "best" recommended
> way of performing a given task. In the cases where we maintain legacy
> documentation, we should clearly point to newer APIs and suggest to users
> the "right" way.
> >> >>
> >> >> Community Work - Many people learn Spark by reading blogs and other
> sites such as StackOverflow. However, many of these resources are out of
> date. Update them, to reduce the cost of eventually removing deprecated
> APIs.
> >> >>
> >> >>
> >> >> </new policy>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Ismaël Mejía <ie...@gmail.com>.

+1 (non-binding)

Michael's section on the trade-offs of maintaining / removing an API are one of
the best reads I have seeing in this mailing list. Enthusiast +1

On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <do...@gmail.com> wrote:
>
> This new policy has a good indention, but can we narrow down on the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>
> I saw that there already exists a reverting PR to bring back Spark 1.4 and 1.5 APIs based on this AS-IS suggestion.
>
> The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and it's nice.
>
> However, for the other cases, it sounds like `recommending older APIs as much as possible` due to the following.
>
>      > How long has the API been in Spark?
>
> We had better be more careful when we add a new policy and should aim not to mislead the users and 3rd party library developers to say "older is better".
>
> Technically, I'm wondering who will use new APIs in their examples (of books and StackOverflow) if they need to write an additional warning like `this only works at 2.4.0+` always .
>
> Bests,
> Dongjoon.
>
> On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com> wrote:
>>
>> I am in broad agreement with the prposal, as any developer, I prefer
>> stable well designed API's :-)
>>
>> Can we tie the proposal to stability guarantees given by spark and
>> reasonable expectation from users ?
>> In my opinion, an unstable or evolving could change - while an
>> experimental api which has been around for ages should be more
>> conservatively handled.
>> Which brings in question what are the stability guarantees as
>> specified by annotations interacting with the proposal.
>>
>> Also, can we expand on 'when' an API change can occur ?  Since we are
>> proposing to diverge from semver.
>> Patch release ? Minor release ? Only major release ? Based on 'impact'
>> of API ? Stability guarantees ?
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <mi...@databricks.com> wrote:
>> >
>> > I'll start off the vote with a strong +1 (binding).
>> >
>> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <mi...@databricks.com> wrote:
>> >>
>> >> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
>> >>
>> >>
>> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
>> >>
>> >>
>> >> [ ] +1 - Spark should adopt this policy.
>> >>
>> >> [ ] -1  - Spark should not adopt this policy.
>> >>
>> >>
>> >> <new policy>
>> >>
>> >>
>> >> Considerations When Breaking APIs
>> >>
>> >> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>> >>
>> >>
>> >> Cost of Breaking an API
>> >>
>> >> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>> >>
>> >> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>> >>
>> >> How long has the API been in Spark?
>> >>
>> >> Is the API common even for basic programs?
>> >>
>> >> How often do we see recent questions in JIRA or mailing lists?
>> >>
>> >> How often does it appear in StackOverflow or blogs?
>> >>
>> >> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>> >>
>> >> Will there be a compiler or linker error?
>> >>
>> >> Will there be a runtime exception?
>> >>
>> >> Will that exception happen after significant processing has been done?
>> >>
>> >> Will we silently return different answers? (very hard to debug, might not even notice!)
>> >>
>> >>
>> >> Cost of Maintaining an API
>> >>
>> >> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>> >>
>> >> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>> >>
>> >> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>> >>
>> >>
>> >> Alternatives to Breaking an API
>> >>
>> >> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>> >>
>> >>
>> >> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>> >>
>> >> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>> >>
>> >> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>> >>
>> >> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>> >>
>> >>
>> >> </new policy>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.

This new policy has a good indention, but can we narrow down on the
migration from Apache Spark 2.4.5 to Apache Spark 3.0+?

I saw that there already exists a reverting PR to bring back Spark 1.4 and
1.5 APIs based on this AS-IS suggestion.

The AS-IS policy is clearly mentioning that JVM/Scala-level difficulty, and
it's nice.

However, for the other cases, it sounds like `recommending older APIs as
much as possible` due to the following.

     > How long has the API been in Spark?

We had better be more careful when we add a new policy and should aim not
to mislead the users and 3rd party library developers to say "older is
better".

Technically, I'm wondering who will use new APIs in their examples (of
books and StackOverflow) if they need to write an additional warning like
`this only works at 2.4.0+` always .

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <mr...@gmail.com> wrote:

> I am in broad agreement with the prposal, as any developer, I prefer
> stable well designed API's :-)
>
> Can we tie the proposal to stability guarantees given by spark and
> reasonable expectation from users ?
> In my opinion, an unstable or evolving could change - while an
> experimental api which has been around for ages should be more
> conservatively handled.
> Which brings in question what are the stability guarantees as
> specified by annotations interacting with the proposal.
>
> Also, can we expand on 'when' an API change can occur ?  Since we are
> proposing to diverge from semver.
> Patch release ? Minor release ? Only major release ? Based on 'impact'
> of API ? Stability guarantees ?
>
> Regards,
> Mridul
>
>
>
> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <mi...@databricks.com>
> wrote:
> >
> > I'll start off the vote with a strong +1 (binding).
> >
> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <mi...@databricks.com>
> wrote:
> >>
> >> I propose to add the following text to Spark's Semantic Versioning
> policy and adopt it as the rubric that should be used when deciding to
> break APIs (even at major versions such as 3.0).
> >>
> >>
> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a
> procedural vote, the measure will pass if there are more favourable votes
> than unfavourable ones. PMC votes are binding, but the community is
> encouraged to add their voice to the discussion.
> >>
> >>
> >> [ ] +1 - Spark should adopt this policy.
> >>
> >> [ ] -1  - Spark should not adopt this policy.
> >>
> >>
> >> <new policy>
> >>
> >>
> >> Considerations When Breaking APIs
> >>
> >> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
> >>
> >>
> >> Cost of Breaking an API
> >>
> >> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
> >>
> >> Usage - an API that is actively used in many different places, is
> always very costly to break. While it is hard to know usage for sure, there
> are a bunch of ways that we can estimate:
> >>
> >> How long has the API been in Spark?
> >>
> >> Is the API common even for basic programs?
> >>
> >> How often do we see recent questions in JIRA or mailing lists?
> >>
> >> How often does it appear in StackOverflow or blogs?
> >>
> >> Behavior after the break - How will a program that works today, work
> after the break? The following are listed roughly in order of increasing
> severity:
> >>
> >> Will there be a compiler or linker error?
> >>
> >> Will there be a runtime exception?
> >>
> >> Will that exception happen after significant processing has been done?
> >>
> >> Will we silently return different answers? (very hard to debug, might
> not even notice!)
> >>
> >>
> >> Cost of Maintaining an API
> >>
> >> Of course, the above does not mean that we will never break any APIs.
> We must also consider the cost both to the project and to our users of
> keeping the API in question.
> >>
> >> Project Costs - Every API we have needs to be tested and needs to keep
> working as other parts of the project changes. These costs are
> significantly exacerbated when external dependencies change (the JVM,
> Scala, etc). In some cases, while not completely technically infeasible,
> the cost of maintaining a particular API can become too high.
> >>
> >> User Costs - APIs also have a cognitive cost to users learning Spark or
> trying to understand Spark programs. This cost becomes even higher when the
> API in question has confusing or undefined semantics.
> >>
> >>
> >> Alternatives to Breaking an API
> >>
> >> In cases where there is a "Bad API", but where the cost of removal is
> also high, there are alternatives that should be considered that do not
> hurt existing users but do address some of the maintenance costs.
> >>
> >>
> >> Avoid Bad APIs - While this is a bit obvious, it is an important point.
> Anytime we are adding a new interface to Spark we should consider that we
> might be stuck with this API forever. Think deeply about how new APIs
> relate to existing ones, as well as how you expect them to evolve over time.
> >>
> >> Deprecation Warnings - All deprecation warnings should point to a clear
> alternative and should never just say that an API is deprecated.
> >>
> >> Updated Docs - Documentation should point to the "best" recommended way
> of performing a given task. In the cases where we maintain legacy
> documentation, we should clearly point to newer APIs and suggest to users
> the "right" way.
> >>
> >> Community Work - Many people learn Spark by reading blogs and other
> sites such as StackOverflow. However, many of these resources are out of
> date. Update them, to reduce the cost of eventually removing deprecated
> APIs.
> >>
> >>
> >> </new policy>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Mridul Muralidharan <mr...@gmail.com>.

I am in broad agreement with the prposal, as any developer, I prefer
stable well designed API's :-)

Can we tie the proposal to stability guarantees given by spark and
reasonable expectation from users ?
In my opinion, an unstable or evolving could change - while an
experimental api which has been around for ages should be more
conservatively handled.
Which brings in question what are the stability guarantees as
specified by annotations interacting with the proposal.

Also, can we expand on 'when' an API change can occur ?  Since we are
proposing to diverge from semver.
Patch release ? Minor release ? Only major release ? Based on 'impact'
of API ? Stability guarantees ?

Regards,
Mridul



On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <mi...@databricks.com> wrote:
>
> I'll start off the vote with a strong +1 (binding).
>
> On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <mi...@databricks.com> wrote:
>>
>> I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).
>>
>>
>> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.
>>
>>
>> [ ] +1 - Spark should adopt this policy.
>>
>> [ ] -1  - Spark should not adopt this policy.
>>
>>
>> <new policy>
>>
>>
>> Considerations When Breaking APIs
>>
>> The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
>>
>>
>> Cost of Breaking an API
>>
>> Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
>>
>> Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:
>>
>> How long has the API been in Spark?
>>
>> Is the API common even for basic programs?
>>
>> How often do we see recent questions in JIRA or mailing lists?
>>
>> How often does it appear in StackOverflow or blogs?
>>
>> Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
>>
>> Will there be a compiler or linker error?
>>
>> Will there be a runtime exception?
>>
>> Will that exception happen after significant processing has been done?
>>
>> Will we silently return different answers? (very hard to debug, might not even notice!)
>>
>>
>> Cost of Maintaining an API
>>
>> Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
>>
>> Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
>>
>> User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
>>
>>
>> Alternatives to Breaking an API
>>
>> In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
>>
>>
>> Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
>>
>> Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
>>
>> Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.
>>
>> Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
>>
>>
>> </new policy>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Michael Armbrust <mi...@databricks.com>.

I'll start off the vote with a strong +1 (binding).

On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <mi...@databricks.com>
wrote:

> I propose to add the following text to Spark's Semantic Versioning policy
> <https://spark.apache.org/versioning-policy.html> and adopt it as the
> rubric that should be used when deciding to break APIs (even at major
> versions such as 3.0).
>
>
> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural
> vote <https://www.apache.org/foundation/voting.html>, the measure will
> pass if there are more favourable votes than unfavourable ones. PMC votes
> are binding, but the community is encouraged to add their voice to the
> discussion.
>
>
> [ ] +1 - Spark should adopt this policy.
>
> [ ] -1  - Spark should not adopt this policy.
>
>
> <new policy>
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    -
>
>    Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    -
>
>       How long has the API been in Spark?
>       -
>
>       Is the API common even for basic programs?
>       -
>
>       How often do we see recent questions in JIRA or mailing lists?
>       -
>
>       How often does it appear in StackOverflow or blogs?
>       -
>
>    Behavior after the break - How will a program that works today, work
>    after the break? The following are listed roughly in order of increasing
>    severity:
>    -
>
>       Will there be a compiler or linker error?
>       -
>
>       Will there be a runtime exception?
>       -
>
>       Will that exception happen after significant processing has been
>       done?
>       -
>
>       Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    -
>
>    Project Costs - Every API we have needs to be tested and needs to keep
>    working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    -
>
>    User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    -
>
>    Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    -
>
>    Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    -
>
>    Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    -
>
>    Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> </new policy>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Michael Armbrust <mi...@databricks.com>.

Thank you for the discussion everyone! This vote passes. I'll work to get
this posed on the website.

+1
Michael Armbrust
Sean Owen
Jules Damji
大啊
Ismaël Mejía
Wenchen Fan
Matei Zaharia
Gengliang Wang
Takeshi Yamamuro
Denny Lee
Xiao Li
Xingbo Jiang
Tkuya UESHIN
Hichael Heuer
John Zhuge
Reynold Xin
Burak Yavuz
Holden Karau
Dongjoon Hyun

To respond to some of the questions on the interpretation of this policy:


> Also, can we expand on 'when' an API change can occur ?  Since we are
> proposing to diverge from semver.
> Patch release ? Minor release ? Only major release ? Based on 'impact' of
> API ? Stability guarantees ?


This is an addition to the existing semvar policy. We still do not break
stable APIs at major versions.

This new policy has a good indention, but can we narrow down on the
> migration from Apache Spark 2.4.5 to Apache Spark 3.0+?


I do not think that we should apply this policy to the 3.0 release any
differently than we will for future releases. There is nothing special
about 3.0 that means unnecessary breakages will not be costly to our users.

If I had to summarize the policy in once sentence it would be "Think about
users before you break APIs!". As I mentioned in my original email, I think
in many cases this did not happen in the lead up to this release. Rather,
the reasoning in some cases was that "This is a major release, so we can
break things".

Given that we all agree major versions are not sufficient justification to
break APIs, I think its reasonable to revisit and discuss on a case-by-case
basis, some of the more commonly used, broken APIs in the context of this
rubric.

We had better be more careful when we add a new policy and should aim not
> to mislead the users and 3rd party library developers to say "older is
> better".


Nothing in the policy says "older is better". It only says that age is one
factor to consider when trying to reason about usage. If an API has been
around for a long time, its possible (but not always true) that it will
have more usage than an old API. If usage is low, and the cost to keep it
is high, get rid of it even if its very old.

The policy also explicitly calls out updating docs to recommend the
new/"correct" way of doing things. If you can convince all the users of
Spark to switch, then you can remove any API you want in the future :)

Is this only applying to stable apis?


This is not explicitly called out, but I would argue you should still think
about users, even when breaking experimental APIs. The bar is certainly
lower here, we explicitly called out that these APIs might change. That
said, I still would go though the exercise and decide if the benefits
outweigh the costs before doing it. (Very similar to the discussion before
our 1.0, before any promises of stability had been made).

the way I read this proposal isn't really saying we can't break api's on
> major releases, its just saying spend more time making sure its worth it.


I agree with this interpretation!

Michael

On Tue, Mar 10, 2020 at 10:59 AM Tom Graves <tg...@yahoo.com> wrote:

> Overall makes sense to me, but have same questions as others on the thread.
>
> Is this only applying to stable apis?
> How are we going to apply to 3.0?
>
> the way I read this proposal isn't really saying we can't break api's on
> major releases, its just saying spend more time making sure its worth it.
> Tom
>
> On Friday, March 6, 2020, 08:59:03 PM CST, Michael Armbrust <
> michael@databricks.com> wrote:
>
>
> I propose to add the following text to Spark's Semantic Versioning policy
> <https://spark.apache.org/versioning-policy.html> and adopt it as the
> rubric that should be used when deciding to break APIs (even at major
> versions such as 3.0).
>
>
> I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural
> vote <https://www.apache.org/foundation/voting.html>, the measure will
> pass if there are more favourable votes than unfavourable ones. PMC votes
> are binding, but the community is encouraged to add their voice to the
> discussion.
>
>
> [ ] +1 - Spark should adopt this policy.
>
> [ ] -1  - Spark should not adopt this policy.
>
>
> <new policy>
>
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    -
>
>    Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    -
>
>       How long has the API been in Spark?
>       -
>
>       Is the API common even for basic programs?
>       -
>
>       How often do we see recent questions in JIRA or mailing lists?
>       -
>
>       How often does it appear in StackOverflow or blogs?
>       -
>
>    Behavior after the break - How will a program that works today, work
>    after the break? The following are listed roughly in order of increasing
>    severity:
>    -
>
>       Will there be a compiler or linker error?
>       -
>
>       Will there be a runtime exception?
>       -
>
>       Will that exception happen after significant processing has been
>       done?
>       -
>
>       Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    -
>
>    Project Costs - Every API we have needs to be tested and needs to keep
>    working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    -
>
>    User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    -
>
>    Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    -
>
>    Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    -
>
>    Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    -
>
>    Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> </new policy>
>

Re: [VOTE] Amend Spark's Semantic Versioning Policy

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

Overall makes sense to me, but have same questions as others on the thread.
Is this only applying to stable apis? How are we going to apply to 3.0?
the way I read this proposal isn't really saying we can't break api's on major releases, its just saying spend more time making sure its worth it.Tom
On Friday, March 6, 2020, 08:59:03 PM CST, Michael Armbrust <mi...@databricks.com> wrote:

I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0).

I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a procedural vote, the measure will pass if there are more favourable votes than unfavourable ones. PMC votes are binding, but the community is encouraged to add their voice to the discussion.

[ ] +1 - Spark should adopt this policy.

[ ] -1 - Spark should not adopt this policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:

-
Usage - an API that is actively used in many different places, is always very costly to break. While it is hard to know usage for sure, there are a bunch of ways that we can estimate:

-
How long has the API been in Spark?

-
Is the API common even for basic programs?

-
How often do we see recent questions in JIRA or mailing lists?

-
How often does it appear in StackOverflow or blogs?

-
Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:

-
Will there be a compiler or linker error?

-
Will there be a runtime exception?

-
Will that exception happen after significant processing has been done?

-
Will we silently return different answers? (very hard to debug, might not even notice!)

Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.

-
Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.

-
User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.

Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.

-
Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.

-
Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.

-
Updated Docs - Documentation should point to the "best" recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the "right" way.

-
Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.

</new policy>