You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2020/03/06 05:08:22 UTC

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Hi, All.

There is a on-going Xiao's PR referencing this email.

https://github.com/apache/spark/pull/27821

Bests,
Dongjoon.

On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:

> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
> wrote:
> >>     1. Could you estimate how many revert commits are required in
> `branch-3.0` for new rubric?
>
> Fair question about what actual change this implies for 3.0? so far it
> seems like some targeted, quite reasonable reverts. I don't think
> anyone's suggesting reverting loads of changes.
>
>
> >>     2. Are you going to revert all removed test cases for the
> deprecated ones?
> > This is a good point, making sure we keep the tests as well is important
> (worse than removing a deprecated API is shipping it broken),.
>
> (I'd say, yes of course! which seems consistent with what is happening now)
>
>
> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
> >>         (I believe it was previously scheduled on June before Spark
> Summit 2020)
> >
> > I think if we need to delay to make a better release this is ok,
> especially given our current preview releases being available to gather
> community feedback.
>
> Of course these things block 3.0 -- all the more reason to keep it
> specific and targeted -- but nothing so far seems inconsistent with
> finishing in a month or two.
>
>
> >> Although there was a discussion already, I want to make the following
> tough parts sure.
> >>     4. We are not going to add Scala 2.11 API, right?
> > I hope not.
> >>
> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
> right?
> > I think doing that would be bad, it's already end of lifed elsewhere.
>
> Yeah this is an important subtext -- the valuable principles here
> could be interpreted in many different ways depending on how much you
> weight value of keeping APIs for compatibility vs value in simplifying
> Spark and pushing users to newer APIs more forcibly. They're all
> judgment calls, based on necessarily limited data about the universe
> of users. We can only go on rare direct user feedback, on feedback
> perhaps from vendors as proxies for a subset of users, and the general
> good faith judgment of committers who have lived Spark for years.
>
> My specific interpretation is that the standard is (correctly)
> tightening going forward, and retroactively a bit for 3.0. But, I do
> not think anyone is advocating for the logical extreme of, for
> example, maintaining Scala 2.11 compatibility indefinitely. I think
> that falls out readily from the rubric here: maintaining 2.11
> compatibility is really quite painful if you ever support 2.13 too,
> for example.
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Jungtaek Lim <ka...@gmail.com>.
Xiao, thanks for the proposal and willingness to lead the effort!

I feel that it's still a bit different from what I've proposed. What I'm
proposing is closer to enforce discussion if the change proposes new public
API or brings breaking change. It's good that we add the section "Does this
PR introduce any user-facing change?" into the PR template (I'm not 100%
sure it's being used as its intention), but it doesn't enforce anything;
PRs containing breaking change are being reviewed and merged as same as
other PRs, no difference. Technically it can be merged in a couple of
hours, with only reviewed by one committer which doesn't seem to be enough
to decide it's good to go, IMHO.

I believe regular digest would be one step forward, as someone could notice
the change and jump in post-hoc review. One thing I'm a bit afraid of
post-hoc review is that it's not easy to expose concerns about already
merged things, especially if we have to revert. It makes both sides be
defensive; hesitate to do post-review, trying to defend the change we
already made. I'm big +1 to make one step further, but given we are
revisiting the policy, it would be nice if we revisit the policy of the
change of public API as well.

On Mon, Mar 9, 2020 at 2:39 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you all. Especially, the Audit efforts.
>
> Until now, the whole community has been working together in the same
> direction with the existing policy. It is always good.
>
> Since it seems that we are considering to have a new direction, I created
> an umbrella JIRA to track all activities.
>
>       https://issues.apache.org/jira/browse/SPARK-31085
>       Amend Spark's Semantic Versioning Policy
>
> As we know, the community-wide directional change always has a huge impact
> on daily PR reviews and regular releases. So, we had better consider all
> the reverting PRs as a normal independent PR instead of the follow-ups.
> Specifically, I believe we need the following.
>
>     1. Have new JIRA IDs instead of considering a simple revert or
> follow-up.
>         It's because we are not adding everything back blindly. For
> example,
>             https://issues.apache.org/jira/browse/SPARK-31089
>             "Add back ImageSchema.readImages in Spark 3.0"
>         is created and closed as 'Won't Do' with consideration between the
> trade-off.
>         We need to have a JIRA-issue-level history for this kind of
> request and the decision.
>
>     2. Sometime, as described by Michael, reverting is insufficient.
>         We need to provide a more fine-grained deprecation for users'
> safety case by case.
>
>     3. Given the timeline, newly added API should have a test coverage in
> the same PR from the beginning.
>         This is required because the whole reverting efforts aim to give a
> working API back.
>
> I believe that we have a good discussion in this thread.
> We are making a big change in Apache Spark history.
> Please be part of the history by taking actions like replying, voting, and
> reviewing.
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Yea, +1 on Jungtaek's suggestion; having the same strict policy for
>> adding new APIs looks nice.
>>
>> > When we making the API changes (e.g., adding the new APIs or changing
>> the existing APIs), we should regularly publish them in the dev list. I am
>> willing to lead this effort, work with my colleagues to summarize all the
>> merged commits [especially the API changes], and then send the *bi-weekly
>> digest *to the dev list
>>
>> This digest looks very helpful for the community, thanks, Xiao!
>>
>> Bests,
>> Takeshi
>>
>> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:
>>
>>> I want to thank you *Ruifeng Zheng* publicly for his work that lists
>>> all the signature differences of Core, SQL and Hive we made in this
>>> upcoming release. For details, please read the files attached in
>>> SPARK-30982 <https://issues.apache.org/jira/browse/SPARK-30982>. I went
>>> over these files and submitted the following PRs to add back the SparkSQL
>>> APIs whose maintenance costs are low based on my own experiences in
>>> SparkSQL development:
>>>
>>>    - https://github.com/apache/spark/pull/27821
>>>    - functions.toDegrees/toRadians
>>>       - functions.approxCountDistinct
>>>       - functions.monotonicallyIncreasingId
>>>       - Column.!==
>>>       - Dataset.explode
>>>       - Dataset.registerTempTable
>>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>>    - https://github.com/apache/spark/pull/27815
>>>       - HiveContext
>>>       - createExternalTable APIs
>>>    -
>>>    - https://github.com/apache/spark/pull/27839
>>>       - SQLContext.applySchema
>>>       - SQLContext.parquetFile
>>>       - SQLContext.jsonFile
>>>       - SQLContext.jsonRDD
>>>       - SQLContext.load
>>>       - SQLContext.jdbc
>>>
>>> If you think these APIs should not be added back, let me know and we can
>>> discuss the items further. In general, I think we should provide more
>>> evidences and discuss them publicly when we dropping these APIs at the
>>> beginning.
>>>
>>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>>> the new APIs or changing the existing APIs), we should regularly publish
>>> them in the dev list. I am willing to lead this effort, work with my
>>> colleagues to summarize all the merged commits [especially the API
>>> changes], and then send the *bi-weekly digest *to the dev list. If you
>>> are willing to join this working group and help build these digests, feel
>>> free to send me a note [lixiao@databricks.com].
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道:
>>>
>>>> +1 for Sean as well.
>>>>
>>>> Moreover, as I added a voice on previous thread, if we want to be
>>>> strict with retaining public API, what we really need to do along with this
>>>> is having similar level or stricter of policy for adding public API. If we
>>>> don't apply the policy symmetrically, problems would go worse as it's still
>>>> not that hard to add public API (only require normal review) but once the
>>>> API is added and released it's going to be really hard to remove it.
>>>>
>>>> If we consider adding public API and deprecating/removing public API as
>>>> "critical" one for the project, IMHO, it would give better visibility and
>>>> open discussion if we make it going through dev@ mailing list instead
>>>> of directly filing a PR. As there're so many PRs being submitted it's
>>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>>> dev@ mailing list is not that crowded so less chance of missing the
>>>> critical changes, and not quickly decided by only a couple of committers.
>>>>
>>>> These suggestions would slow down the developments - that would make us
>>>> realize we may want to "classify/mark" user facing public APIs and others
>>>> (just exposed as public) and only apply all the policies to former. For
>>>> latter we don't need to guarantee anything.
>>>>
>>>>
>>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for Sean's concerns and questions.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> This thread established some good general principles, illustrated by
>>>>>> a few good examples. It didn't draw specific conclusions about what to add
>>>>>> back, which is why it wasn't at all controversial. What it means in
>>>>>> specific cases is where there may be disagreement, and that harder question
>>>>>> hasn't been addressed.
>>>>>>
>>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>>> there are several more going on now, some pretty broad. I am not even sure
>>>>>> what all of them are. In addition to below,
>>>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>>>> overhead to post to this thread any changes that one believes are endorsed
>>>>>> by these principles and perhaps a more strict interpretation of them now?
>>>>>> It's important enough we should get any data points or input, and now.
>>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>>> actually sounds like a good vehicle for that -- as long as people know
>>>>>> about them!
>>>>>>
>>>>>> Also, is there any usage data available to share? many arguments turn
>>>>>> around 'commonly used' but can we know that more concretely?
>>>>>>
>>>>>> Otherwise I think we'll back into implementing personal
>>>>>> interpretations of general principles, which is arguably the issue in the
>>>>>> first place, even when everyone believes in good faith in the same
>>>>>> principles.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>>> *well-known* virus.
>>>>>>> Can we finalize this first before doing unofficial personal
>>>>>>> decisions?
>>>>>>> Technically, this thread was not a vote and our website doesn't have
>>>>>>> a clear policy yet.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27835
>>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>>>> operands"
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27834
>>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <
>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>> >>     1. Could you estimate how many revert commits are required
>>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>>
>>>>>>>>> Fair question about what actual change this implies for 3.0? so
>>>>>>>>> far it
>>>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>>>> deprecated ones?
>>>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>>>
>>>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>>>> happening now)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>>>> Spark Summit 2020)
>>>>>>>>> >
>>>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>>>> especially given our current preview releases being available to gather
>>>>>>>>> community feedback.
>>>>>>>>>
>>>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>>>> finishing in a month or two.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>>>> following tough parts sure.
>>>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>>>> > I hope not.
>>>>>>>>> >>
>>>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>>>> 3.1+, right?
>>>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>>>> elsewhere.
>>>>>>>>>
>>>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>>>> could be interpreted in many different ways depending on how much
>>>>>>>>> you
>>>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>>>> simplifying
>>>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>>>> judgment calls, based on necessarily limited data about the
>>>>>>>>> universe
>>>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>>>> general
>>>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>>>
>>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I
>>>>>>>>> do
>>>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>>>> for example.
>>>>>>>>>
>>>>>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you all. Especially, the Audit efforts.

Until now, the whole community has been working together in the same
direction with the existing policy. It is always good.

Since it seems that we are considering to have a new direction, I created
an umbrella JIRA to track all activities.

      https://issues.apache.org/jira/browse/SPARK-31085
      Amend Spark's Semantic Versioning Policy

As we know, the community-wide directional change always has a huge impact
on daily PR reviews and regular releases. So, we had better consider all
the reverting PRs as a normal independent PR instead of the follow-ups.
Specifically, I believe we need the following.

    1. Have new JIRA IDs instead of considering a simple revert or
follow-up.
        It's because we are not adding everything back blindly. For example,
            https://issues.apache.org/jira/browse/SPARK-31089
            "Add back ImageSchema.readImages in Spark 3.0"
        is created and closed as 'Won't Do' with consideration between the
trade-off.
        We need to have a JIRA-issue-level history for this kind of request
and the decision.

    2. Sometime, as described by Michael, reverting is insufficient.
        We need to provide a more fine-grained deprecation for users'
safety case by case.

    3. Given the timeline, newly added API should have a test coverage in
the same PR from the beginning.
        This is required because the whole reverting efforts aim to give a
working API back.

I believe that we have a good discussion in this thread.
We are making a big change in Apache Spark history.
Please be part of the history by taking actions like replying, voting, and
reviewing.

Thanks,
Dongjoon.


On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <li...@gmail.com>
wrote:

> Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
> new APIs looks nice.
>
> > When we making the API changes (e.g., adding the new APIs or changing
> the existing APIs), we should regularly publish them in the dev list. I am
> willing to lead this effort, work with my colleagues to summarize all the
> merged commits [especially the API changes], and then send the *bi-weekly
> digest *to the dev list
>
> This digest looks very helpful for the community, thanks, Xiao!
>
> Bests,
> Takeshi
>
> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:
>
>> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
>> the signature differences of Core, SQL and Hive we made in this upcoming
>> release. For details, please read the files attached in SPARK-30982
>> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
>> files and submitted the following PRs to add back the SparkSQL APIs whose
>> maintenance costs are low based on my own experiences in SparkSQL
>> development:
>>
>>    - https://github.com/apache/spark/pull/27821
>>    - functions.toDegrees/toRadians
>>       - functions.approxCountDistinct
>>       - functions.monotonicallyIncreasingId
>>       - Column.!==
>>       - Dataset.explode
>>       - Dataset.registerTempTable
>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>    - https://github.com/apache/spark/pull/27815
>>       - HiveContext
>>       - createExternalTable APIs
>>    -
>>    - https://github.com/apache/spark/pull/27839
>>       - SQLContext.applySchema
>>       - SQLContext.parquetFile
>>       - SQLContext.jsonFile
>>       - SQLContext.jsonRDD
>>       - SQLContext.load
>>       - SQLContext.jdbc
>>
>> If you think these APIs should not be added back, let me know and we can
>> discuss the items further. In general, I think we should provide more
>> evidences and discuss them publicly when we dropping these APIs at the
>> beginning.
>>
>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>> the new APIs or changing the existing APIs), we should regularly publish
>> them in the dev list. I am willing to lead this effort, work with my
>> colleagues to summarize all the merged commits [especially the API
>> changes], and then send the *bi-weekly digest *to the dev list. If you
>> are willing to join this working group and help build these digests, feel
>> free to send me a note [lixiao@databricks.com].
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道:
>>
>>> +1 for Sean as well.
>>>
>>> Moreover, as I added a voice on previous thread, if we want to be strict
>>> with retaining public API, what we really need to do along with this is
>>> having similar level or stricter of policy for adding public API. If we
>>> don't apply the policy symmetrically, problems would go worse as it's still
>>> not that hard to add public API (only require normal review) but once the
>>> API is added and released it's going to be really hard to remove it.
>>>
>>> If we consider adding public API and deprecating/removing public API as
>>> "critical" one for the project, IMHO, it would give better visibility and
>>> open discussion if we make it going through dev@ mailing list instead
>>> of directly filing a PR. As there're so many PRs being submitted it's
>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>> dev@ mailing list is not that crowded so less chance of missing the
>>> critical changes, and not quickly decided by only a couple of committers.
>>>
>>> These suggestions would slow down the developments - that would make us
>>> realize we may want to "classify/mark" user facing public APIs and others
>>> (just exposed as public) and only apply all the policies to former. For
>>> latter we don't need to guarantee anything.
>>>
>>>
>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> +1 for Sean's concerns and questions.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> This thread established some good general principles, illustrated by a
>>>>> few good examples. It didn't draw specific conclusions about what to add
>>>>> back, which is why it wasn't at all controversial. What it means in
>>>>> specific cases is where there may be disagreement, and that harder question
>>>>> hasn't been addressed.
>>>>>
>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>> there are several more going on now, some pretty broad. I am not even sure
>>>>> what all of them are. In addition to below,
>>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>>> overhead to post to this thread any changes that one believes are endorsed
>>>>> by these principles and perhaps a more strict interpretation of them now?
>>>>> It's important enough we should get any data points or input, and now.
>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>> actually sounds like a good vehicle for that -- as long as people know
>>>>> about them!
>>>>>
>>>>> Also, is there any usage data available to share? many arguments turn
>>>>> around 'commonly used' but can we know that more concretely?
>>>>>
>>>>> Otherwise I think we'll back into implementing personal
>>>>> interpretations of general principles, which is arguably the issue in the
>>>>> first place, even when everyone believes in good faith in the same
>>>>> principles.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>> *well-known* virus.
>>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>>> Technically, this thread was not a vote and our website doesn't have
>>>>>> a clear policy yet.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27835
>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>>> operands"
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27834
>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>> >>     1. Could you estimate how many revert commits are required
>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>
>>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>>> it
>>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>>> deprecated ones?
>>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>>
>>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>>> happening now)
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>>> Spark Summit 2020)
>>>>>>>> >
>>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>>> especially given our current preview releases being available to gather
>>>>>>>> community feedback.
>>>>>>>>
>>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>>> finishing in a month or two.
>>>>>>>>
>>>>>>>>
>>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>>> following tough parts sure.
>>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>>> > I hope not.
>>>>>>>> >>
>>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>>> 3.1+, right?
>>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>>> elsewhere.
>>>>>>>>
>>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>>> could be interpreted in many different ways depending on how much
>>>>>>>> you
>>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>>> simplifying
>>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>>> general
>>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>>
>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>>> for example.
>>>>>>>>
>>>>>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Takeshi Yamamuro <li...@gmail.com>.
Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
new APIs looks nice.

> When we making the API changes (e.g., adding the new APIs or changing the
existing APIs), we should regularly publish them in the dev list. I am
willing to lead this effort, work with my colleagues to summarize all the
merged commits [especially the API changes], and then send the *bi-weekly
digest *to the dev list

This digest looks very helpful for the community, thanks, Xiao!

Bests,
Takeshi

On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <ga...@gmail.com> wrote:

> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
> the signature differences of Core, SQL and Hive we made in this upcoming
> release. For details, please read the files attached in SPARK-30982
> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
> files and submitted the following PRs to add back the SparkSQL APIs whose
> maintenance costs are low based on my own experiences in SparkSQL
> development:
>
>    - https://github.com/apache/spark/pull/27821
>    - functions.toDegrees/toRadians
>       - functions.approxCountDistinct
>       - functions.monotonicallyIncreasingId
>       - Column.!==
>       - Dataset.explode
>       - Dataset.registerTempTable
>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>    - https://github.com/apache/spark/pull/27815
>       - HiveContext
>       - createExternalTable APIs
>    -
>    - https://github.com/apache/spark/pull/27839
>       - SQLContext.applySchema
>       - SQLContext.parquetFile
>       - SQLContext.jsonFile
>       - SQLContext.jsonRDD
>       - SQLContext.load
>       - SQLContext.jdbc
>
> If you think these APIs should not be added back, let me know and we can
> discuss the items further. In general, I think we should provide more
> evidences and discuss them publicly when we dropping these APIs at the
> beginning.
>
> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
> the new APIs or changing the existing APIs), we should regularly publish
> them in the dev list. I am willing to lead this effort, work with my
> colleagues to summarize all the merged commits [especially the API
> changes], and then send the *bi-weekly digest *to the dev list. If you
> are willing to join this working group and help build these digests, feel
> free to send me a note [lixiao@databricks.com].
>
> Cheers,
>
> Xiao
>
>
>
>
> Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道:
>
>> +1 for Sean as well.
>>
>> Moreover, as I added a voice on previous thread, if we want to be strict
>> with retaining public API, what we really need to do along with this is
>> having similar level or stricter of policy for adding public API. If we
>> don't apply the policy symmetrically, problems would go worse as it's still
>> not that hard to add public API (only require normal review) but once the
>> API is added and released it's going to be really hard to remove it.
>>
>> If we consider adding public API and deprecating/removing public API as
>> "critical" one for the project, IMHO, it would give better visibility and
>> open discussion if we make it going through dev@ mailing list instead of
>> directly filing a PR. As there're so many PRs being submitted it's nearly
>> impossible to look into all of PRs - it may require us to "watch" the repo
>> and have tons of mails. Compared to the popularity on Github PRs, dev@
>> mailing list is not that crowded so less chance of missing the critical
>> changes, and not quickly decided by only a couple of committers.
>>
>> These suggestions would slow down the developments - that would make us
>> realize we may want to "classify/mark" user facing public APIs and others
>> (just exposed as public) and only apply all the policies to former. For
>> latter we don't need to guarantee anything.
>>
>>
>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> +1 for Sean's concerns and questions.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> This thread established some good general principles, illustrated by a
>>>> few good examples. It didn't draw specific conclusions about what to add
>>>> back, which is why it wasn't at all controversial. What it means in
>>>> specific cases is where there may be disagreement, and that harder question
>>>> hasn't been addressed.
>>>>
>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>> there are several more going on now, some pretty broad. I am not even sure
>>>> what all of them are. In addition to below,
>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>> overhead to post to this thread any changes that one believes are endorsed
>>>> by these principles and perhaps a more strict interpretation of them now?
>>>> It's important enough we should get any data points or input, and now.
>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>> actually sounds like a good vehicle for that -- as long as people know
>>>> about them!
>>>>
>>>> Also, is there any usage data available to share? many arguments turn
>>>> around 'commonly used' but can we know that more concretely?
>>>>
>>>> Otherwise I think we'll back into implementing personal interpretations
>>>> of general principles, which is arguably the issue in the first place, even
>>>> when everyone believes in good faith in the same principles.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>>> virus.
>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>> Technically, this thread was not a vote and our website doesn't have a
>>>>> clear policy yet.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>
>>>>> https://github.com/apache/spark/pull/27835
>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>> operands"
>>>>>
>>>>> https://github.com/apache/spark/pull/27834
>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>>> `branch-3.0` for new rubric?
>>>>>>>
>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>> it
>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>
>>>>>>>
>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>> deprecated ones?
>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>
>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>> happening now)
>>>>>>>
>>>>>>>
>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>> Spark Summit 2020)
>>>>>>> >
>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>> especially given our current preview releases being available to gather
>>>>>>> community feedback.
>>>>>>>
>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>> finishing in a month or two.
>>>>>>>
>>>>>>>
>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>> following tough parts sure.
>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>> > I hope not.
>>>>>>> >>
>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>> 3.1+, right?
>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>> elsewhere.
>>>>>>>
>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>> could be interpreted in many different ways depending on how much you
>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>> simplifying
>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>> general
>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>
>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>> for example.
>>>>>>>
>>>>>>

-- 
---
Takeshi Yamamuro

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Xiao Li <ga...@gmail.com>.
 I want to thank you *Ruifeng Zheng* publicly for his work that lists all
the signature differences of Core, SQL and Hive we made in this upcoming
release. For details, please read the files attached in SPARK-30982
<https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
files and submitted the following PRs to add back the SparkSQL APIs whose
maintenance costs are low based on my own experiences in SparkSQL
development:

   - https://github.com/apache/spark/pull/27821
   - functions.toDegrees/toRadians
      - functions.approxCountDistinct
      - functions.monotonicallyIncreasingId
      - Column.!==
      - Dataset.explode
      - Dataset.registerTempTable
      - SQLContext.getOrCreate, setActive, clearActive, constructors
   - https://github.com/apache/spark/pull/27815
      - HiveContext
      - createExternalTable APIs
   -
   - https://github.com/apache/spark/pull/27839
      - SQLContext.applySchema
      - SQLContext.parquetFile
      - SQLContext.jsonFile
      - SQLContext.jsonRDD
      - SQLContext.load
      - SQLContext.jdbc

If you think these APIs should not be added back, let me know and we can
discuss the items further. In general, I think we should provide more
evidences and discuss them publicly when we dropping these APIs at the
beginning.

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the
new APIs or changing the existing APIs), we should regularly publish them
in the dev list. I am willing to lead this effort, work with my colleagues
to summarize all the merged commits [especially the API changes], and then
send the *bi-weekly digest *to the dev list. If you are willing to join
this working group and help build these digests, feel free to send me a
note [lixiao@databricks.com].

Cheers,

Xiao




Jungtaek Lim <ka...@gmail.com> 于2020年3月7日周六 下午4:50写道:

> +1 for Sean as well.
>
> Moreover, as I added a voice on previous thread, if we want to be strict
> with retaining public API, what we really need to do along with this is
> having similar level or stricter of policy for adding public API. If we
> don't apply the policy symmetrically, problems would go worse as it's still
> not that hard to add public API (only require normal review) but once the
> API is added and released it's going to be really hard to remove it.
>
> If we consider adding public API and deprecating/removing public API as
> "critical" one for the project, IMHO, it would give better visibility and
> open discussion if we make it going through dev@ mailing list instead of
> directly filing a PR. As there're so many PRs being submitted it's nearly
> impossible to look into all of PRs - it may require us to "watch" the repo
> and have tons of mails. Compared to the popularity on Github PRs, dev@
> mailing list is not that crowded so less chance of missing the critical
> changes, and not quickly decided by only a couple of committers.
>
> These suggestions would slow down the developments - that would make us
> realize we may want to "classify/mark" user facing public APIs and others
> (just exposed as public) and only apply all the policies to former. For
> latter we don't need to guarantee anything.
>
>
> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> +1 for Sean's concerns and questions.
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> This thread established some good general principles, illustrated by a
>>> few good examples. It didn't draw specific conclusions about what to add
>>> back, which is why it wasn't at all controversial. What it means in
>>> specific cases is where there may be disagreement, and that harder question
>>> hasn't been addressed.
>>>
>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>> there are several more going on now, some pretty broad. I am not even sure
>>> what all of them are. In addition to below,
>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>> overhead to post to this thread any changes that one believes are endorsed
>>> by these principles and perhaps a more strict interpretation of them now?
>>> It's important enough we should get any data points or input, and now.
>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>> actually sounds like a good vehicle for that -- as long as people know
>>> about them!
>>>
>>> Also, is there any usage data available to share? many arguments turn
>>> around 'commonly used' but can we know that more concretely?
>>>
>>> Otherwise I think we'll back into implementing personal interpretations
>>> of general principles, which is arguably the issue in the first place, even
>>> when everyone believes in good faith in the same principles.
>>>
>>>
>>>
>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>> virus.
>>>> Can we finalize this first before doing unofficial personal decisions?
>>>> Technically, this thread was not a vote and our website doesn't have a
>>>> clear policy yet.
>>>>
>>>> https://github.com/apache/spark/pull/27821
>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>     ==> This technically revert most of the SPARK-25908.
>>>>
>>>> https://github.com/apache/spark/pull/27835
>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>> operands"
>>>>
>>>> https://github.com/apache/spark/pull/27834
>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>> `branch-3.0` for new rubric?
>>>>>>
>>>>>> Fair question about what actual change this implies for 3.0? so far it
>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>
>>>>>>
>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>> deprecated ones?
>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>
>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>> happening now)
>>>>>>
>>>>>>
>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>> Spark Summit 2020)
>>>>>> >
>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>> especially given our current preview releases being available to gather
>>>>>> community feedback.
>>>>>>
>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>> finishing in a month or two.
>>>>>>
>>>>>>
>>>>>> >> Although there was a discussion already, I want to make the
>>>>>> following tough parts sure.
>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>> > I hope not.
>>>>>> >>
>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>> 3.1+, right?
>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>> elsewhere.
>>>>>>
>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>> could be interpreted in many different ways depending on how much you
>>>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>
>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>> for example.
>>>>>>
>>>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Jungtaek Lim <ka...@gmail.com>.
+1 for Sean as well.

Moreover, as I added a voice on previous thread, if we want to be strict
with retaining public API, what we really need to do along with this is
having similar level or stricter of policy for adding public API. If we
don't apply the policy symmetrically, problems would go worse as it's still
not that hard to add public API (only require normal review) but once the
API is added and released it's going to be really hard to remove it.

If we consider adding public API and deprecating/removing public API as
"critical" one for the project, IMHO, it would give better visibility and
open discussion if we make it going through dev@ mailing list instead of
directly filing a PR. As there're so many PRs being submitted it's nearly
impossible to look into all of PRs - it may require us to "watch" the repo
and have tons of mails. Compared to the popularity on Github PRs, dev@
mailing list is not that crowded so less chance of missing the critical
changes, and not quickly decided by only a couple of committers.

These suggestions would slow down the developments - that would make us
realize we may want to "classify/mark" user facing public APIs and others
(just exposed as public) and only apply all the policies to former. For
latter we don't need to guarantee anything.


On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> +1 for Sean's concerns and questions.
>
> Bests,
> Dongjoon.
>
> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:
>
>> This thread established some good general principles, illustrated by a
>> few good examples. It didn't draw specific conclusions about what to add
>> back, which is why it wasn't at all controversial. What it means in
>> specific cases is where there may be disagreement, and that harder question
>> hasn't been addressed.
>>
>> The reverts I have seen so far seemed like the obvious one, but yes,
>> there are several more going on now, some pretty broad. I am not even sure
>> what all of them are. In addition to below,
>> https://github.com/apache/spark/pull/27839. Would it be too much
>> overhead to post to this thread any changes that one believes are endorsed
>> by these principles and perhaps a more strict interpretation of them now?
>> It's important enough we should get any data points or input, and now.
>> (We're obviously not going to debate each one.) A draft PR, or several,
>> actually sounds like a good vehicle for that -- as long as people know
>> about them!
>>
>> Also, is there any usage data available to share? many arguments turn
>> around 'commonly used' but can we know that more concretely?
>>
>> Otherwise I think we'll back into implementing personal interpretations
>> of general principles, which is arguably the issue in the first place, even
>> when everyone believes in good faith in the same principles.
>>
>>
>>
>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>> virus.
>>> Can we finalize this first before doing unofficial personal decisions?
>>> Technically, this thread was not a vote and our website doesn't have a
>>> clear policy yet.
>>>
>>> https://github.com/apache/spark/pull/27821
>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>     ==> This technically revert most of the SPARK-25908.
>>>
>>> https://github.com/apache/spark/pull/27835
>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>> operands"
>>>
>>> https://github.com/apache/spark/pull/27834
>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> There is a on-going Xiao's PR referencing this email.
>>>>
>>>> https://github.com/apache/spark/pull/27821
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>> `branch-3.0` for new rubric?
>>>>>
>>>>> Fair question about what actual change this implies for 3.0? so far it
>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>> anyone's suggesting reverting loads of changes.
>>>>>
>>>>>
>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>> deprecated ones?
>>>>> > This is a good point, making sure we keep the tests as well is
>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>
>>>>> (I'd say, yes of course! which seems consistent with what is happening
>>>>> now)
>>>>>
>>>>>
>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>> >>         (I believe it was previously scheduled on June before Spark
>>>>> Summit 2020)
>>>>> >
>>>>> > I think if we need to delay to make a better release this is ok,
>>>>> especially given our current preview releases being available to gather
>>>>> community feedback.
>>>>>
>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>> finishing in a month or two.
>>>>>
>>>>>
>>>>> >> Although there was a discussion already, I want to make the
>>>>> following tough parts sure.
>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>> > I hope not.
>>>>> >>
>>>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>>>> right?
>>>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>>>
>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>> could be interpreted in many different ways depending on how much you
>>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>> judgment calls, based on necessarily limited data about the universe
>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>
>>>>> My specific interpretation is that the standard is (correctly)
>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>> not think anyone is advocating for the logical extreme of, for
>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>> for example.
>>>>>
>>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.
+1 for Sean's concerns and questions.

Bests,
Dongjoon.

On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sr...@gmail.com> wrote:

> This thread established some good general principles, illustrated by a few
> good examples. It didn't draw specific conclusions about what to add back,
> which is why it wasn't at all controversial. What it means in specific
> cases is where there may be disagreement, and that harder question hasn't
> been addressed.
>
> The reverts I have seen so far seemed like the obvious one, but yes, there
> are several more going on now, some pretty broad. I am not even sure what
> all of them are. In addition to below,
> https://github.com/apache/spark/pull/27839. Would it be too much overhead
> to post to this thread any changes that one believes are endorsed by these
> principles and perhaps a more strict interpretation of them now? It's
> important enough we should get any data points or input, and now. (We're
> obviously not going to debate each one.) A draft PR, or several, actually
> sounds like a good vehicle for that -- as long as people know about them!
>
> Also, is there any usage data available to share? many arguments turn
> around 'commonly used' but can we know that more concretely?
>
> Otherwise I think we'll back into implementing personal interpretations of
> general principles, which is arguably the issue in the first place, even
> when everyone believes in good faith in the same principles.
>
>
>
> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Recently, reverting PRs seems to start to spread like the *well-known*
>> virus.
>> Can we finalize this first before doing unofficial personal decisions?
>> Technically, this thread was not a vote and our website doesn't have a
>> clear policy yet.
>>
>> https://github.com/apache/spark/pull/27821
>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>     ==> This technically revert most of the SPARK-25908.
>>
>> https://github.com/apache/spark/pull/27835
>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>> operands"
>>
>> https://github.com/apache/spark/pull/27834
>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> There is a on-going Xiao's PR referencing this email.
>>>
>>> https://github.com/apache/spark/pull/27821
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>> >>     1. Could you estimate how many revert commits are required in
>>>> `branch-3.0` for new rubric?
>>>>
>>>> Fair question about what actual change this implies for 3.0? so far it
>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>> anyone's suggesting reverting loads of changes.
>>>>
>>>>
>>>> >>     2. Are you going to revert all removed test cases for the
>>>> deprecated ones?
>>>> > This is a good point, making sure we keep the tests as well is
>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>
>>>> (I'd say, yes of course! which seems consistent with what is happening
>>>> now)
>>>>
>>>>
>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>> >>         (I believe it was previously scheduled on June before Spark
>>>> Summit 2020)
>>>> >
>>>> > I think if we need to delay to make a better release this is ok,
>>>> especially given our current preview releases being available to gather
>>>> community feedback.
>>>>
>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>> finishing in a month or two.
>>>>
>>>>
>>>> >> Although there was a discussion already, I want to make the
>>>> following tough parts sure.
>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>> > I hope not.
>>>> >>
>>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>>> right?
>>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>>
>>>> Yeah this is an important subtext -- the valuable principles here
>>>> could be interpreted in many different ways depending on how much you
>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>> judgment calls, based on necessarily limited data about the universe
>>>> of users. We can only go on rare direct user feedback, on feedback
>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>> good faith judgment of committers who have lived Spark for years.
>>>>
>>>> My specific interpretation is that the standard is (correctly)
>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>> not think anyone is advocating for the logical extreme of, for
>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>> that falls out readily from the rubric here: maintaining 2.11
>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>> for example.
>>>>
>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Sean Owen <sr...@gmail.com>.
This thread established some good general principles, illustrated by a few
good examples. It didn't draw specific conclusions about what to add back,
which is why it wasn't at all controversial. What it means in specific
cases is where there may be disagreement, and that harder question hasn't
been addressed.

The reverts I have seen so far seemed like the obvious one, but yes, there
are several more going on now, some pretty broad. I am not even sure what
all of them are. In addition to below,
https://github.com/apache/spark/pull/27839. Would it be too much overhead
to post to this thread any changes that one believes are endorsed by these
principles and perhaps a more strict interpretation of them now? It's
important enough we should get any data points or input, and now. (We're
obviously not going to debate each one.) A draft PR, or several, actually
sounds like a good vehicle for that -- as long as people know about them!

Also, is there any usage data available to share? many arguments turn
around 'commonly used' but can we know that more concretely?

Otherwise I think we'll back into implementing personal interpretations of
general principles, which is arguably the issue in the first place, even
when everyone believes in good faith in the same principles.



On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Recently, reverting PRs seems to start to spread like the *well-known*
> virus.
> Can we finalize this first before doing unofficial personal decisions?
> Technically, this thread was not a vote and our website doesn't have a
> clear policy yet.
>
> https://github.com/apache/spark/pull/27821
> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>     ==> This technically revert most of the SPARK-25908.
>
> https://github.com/apache/spark/pull/27835
> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
> operands"
>
> https://github.com/apache/spark/pull/27834
> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>
> Bests,
> Dongjoon.
>
> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> There is a on-going Xiao's PR referencing this email.
>>
>> https://github.com/apache/spark/pull/27821
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>> >>     1. Could you estimate how many revert commits are required in
>>> `branch-3.0` for new rubric?
>>>
>>> Fair question about what actual change this implies for 3.0? so far it
>>> seems like some targeted, quite reasonable reverts. I don't think
>>> anyone's suggesting reverting loads of changes.
>>>
>>>
>>> >>     2. Are you going to revert all removed test cases for the
>>> deprecated ones?
>>> > This is a good point, making sure we keep the tests as well is
>>> important (worse than removing a deprecated API is shipping it broken),.
>>>
>>> (I'd say, yes of course! which seems consistent with what is happening
>>> now)
>>>
>>>
>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>> >>         (I believe it was previously scheduled on June before Spark
>>> Summit 2020)
>>> >
>>> > I think if we need to delay to make a better release this is ok,
>>> especially given our current preview releases being available to gather
>>> community feedback.
>>>
>>> Of course these things block 3.0 -- all the more reason to keep it
>>> specific and targeted -- but nothing so far seems inconsistent with
>>> finishing in a month or two.
>>>
>>>
>>> >> Although there was a discussion already, I want to make the following
>>> tough parts sure.
>>> >>     4. We are not going to add Scala 2.11 API, right?
>>> > I hope not.
>>> >>
>>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>>> right?
>>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>>
>>> Yeah this is an important subtext -- the valuable principles here
>>> could be interpreted in many different ways depending on how much you
>>> weight value of keeping APIs for compatibility vs value in simplifying
>>> Spark and pushing users to newer APIs more forcibly. They're all
>>> judgment calls, based on necessarily limited data about the universe
>>> of users. We can only go on rare direct user feedback, on feedback
>>> perhaps from vendors as proxies for a subset of users, and the general
>>> good faith judgment of committers who have lived Spark for years.
>>>
>>> My specific interpretation is that the standard is (correctly)
>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>> not think anyone is advocating for the logical extreme of, for
>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>> that falls out readily from the rubric here: maintaining 2.11
>>> compatibility is really quite painful if you ever support 2.13 too,
>>> for example.
>>>
>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, All.

Recently, reverting PRs seems to start to spread like the *well-known*
virus.
Can we finalize this first before doing unofficial personal decisions?
Technically, this thread was not a vote and our website doesn't have a
clear policy yet.

https://github.com/apache/spark/pull/27821
[SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
    ==> This technically revert most of the SPARK-25908.

https://github.com/apache/spark/pull/27835
Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the operands"

https://github.com/apache/spark/pull/27834
Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Bests,
Dongjoon.

On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> There is a on-going Xiao's PR referencing this email.
>
> https://github.com/apache/spark/pull/27821
>
> Bests,
> Dongjoon.
>
> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sr...@gmail.com> wrote:
>
>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>> >>     1. Could you estimate how many revert commits are required in
>> `branch-3.0` for new rubric?
>>
>> Fair question about what actual change this implies for 3.0? so far it
>> seems like some targeted, quite reasonable reverts. I don't think
>> anyone's suggesting reverting loads of changes.
>>
>>
>> >>     2. Are you going to revert all removed test cases for the
>> deprecated ones?
>> > This is a good point, making sure we keep the tests as well is
>> important (worse than removing a deprecated API is shipping it broken),.
>>
>> (I'd say, yes of course! which seems consistent with what is happening
>> now)
>>
>>
>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>> >>         (I believe it was previously scheduled on June before Spark
>> Summit 2020)
>> >
>> > I think if we need to delay to make a better release this is ok,
>> especially given our current preview releases being available to gather
>> community feedback.
>>
>> Of course these things block 3.0 -- all the more reason to keep it
>> specific and targeted -- but nothing so far seems inconsistent with
>> finishing in a month or two.
>>
>>
>> >> Although there was a discussion already, I want to make the following
>> tough parts sure.
>> >>     4. We are not going to add Scala 2.11 API, right?
>> > I hope not.
>> >>
>> >>     5. We are not going to support Python 2.x in Apache Spark 3.1+,
>> right?
>> > I think doing that would be bad, it's already end of lifed elsewhere.
>>
>> Yeah this is an important subtext -- the valuable principles here
>> could be interpreted in many different ways depending on how much you
>> weight value of keeping APIs for compatibility vs value in simplifying
>> Spark and pushing users to newer APIs more forcibly. They're all
>> judgment calls, based on necessarily limited data about the universe
>> of users. We can only go on rare direct user feedback, on feedback
>> perhaps from vendors as proxies for a subset of users, and the general
>> good faith judgment of committers who have lived Spark for years.
>>
>> My specific interpretation is that the standard is (correctly)
>> tightening going forward, and retroactively a bit for 3.0. But, I do
>> not think anyone is advocating for the logical extreme of, for
>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>> that falls out readily from the rubric here: maintaining 2.11
>> compatibility is really quite painful if you ever support 2.13 too,
>> for example.
>>
>