You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Kyle Bendickson <ky...@tabular.io> on 2021/11/03 23:13:20 UTC

Standard practices around PRs against multiple Spark versions

I submitted a PR to fix a Spark bug today, applying the same changes to all
eligible Spark versions.

Jack mentioned that he thought the practice going forward was to fix /
apply changes on the latest Spark version in one PR, and then open a second
PR to backport the fixes (presumably to minimize review overhead).

Do we have a standard / preference on that? Jack mentioned he wasn't
certain, so I thought I'd ask here.

Seems like a good practice but hoping to get some clarification :)

-- 
Best,
Kyle Bendickson
Github: @kbendick

Re: Standard practices around PRs against multiple Spark versions

Posted by Kyle Bendickson <ky...@tabular.io>.
+1 as well. I too have found most places that have a bigger PR template,
its utility goes down as it gets larger.

And smaller patches are typically how we get new contributors. Or they want
to submit one patch in an area that some of the regular contributors might
not work with much, but they don’t otherwise have that much time to be
active contributors.

The template should be friendly for new people who possibly just want to
fix one thing that affects them - often those patches are very valuable.

One thing that might be nice in the future would be having a bot that opens
issues for the backport, possibly via slash commands. So the committer who
merged it in can possibly put `/icebergbot backport issue spark 3.1, 3.0`.

I know that’s likely more work than we have bandwidth for at present, but
putting the idea out as we work through the issues surrounding slicing our
tooling for many versions.

- Kyle

On Fri, Nov 5, 2021 at 2:59 PM Jack Ye <ye...@gmail.com> wrote:

> definitely +1 for having documentations to point to instead of laying
> everything out in the PR template!
>
> -Jack
>
> On Fri, Nov 5, 2021 at 2:55 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> I agree with Jack about keeping changes targeted at one Spark version.
>> That makes it possible to revert changes to one Spark version rather than
>> manually rolling back. And it also keeps PRs smaller and more focused. I
>> think it wastes more contributor time to make updates to a PR in 3 or 4
>> places and keep them in sync.
>>
>> For the template for issues and PRs, I'm okay with that, but let's not
>> get carried away. Let's document style, standards, and practices and link
>> to that doc rather than starting off with a wall of text. The templates for
>> other projects, like Parquet, have so much content that isn't useful that I
>> think they're encouraging people with small commits to abandon the effort.
>>
>> Ryan
>>
>> On Fri, Nov 5, 2021 at 1:06 PM Kyle Bendickson <ky...@tabular.io> wrote:
>>
>>> I like Yufei's idea of adding a template.
>>>
>>> As per the concern about making changes separate and having things fall
>>> through the cracks, I do share that concern.
>>>
>>> On a related note though, I've already observed that a few times with
>>> simple things that maybe people were working on before the split up. These
>>> will go away with time, though could happen in the future when we add a new
>>> version.
>>>
>>> I agree a PR template would go a long way in ensuring that people are at
>>> least aware.
>>>
>>> It would be great if more folks who maintain forks could chime in as
>>> well, as this has ramifications for that as well.
>>>
>>> But overall, especially for larger PRs, I think it's a great idea.
>>>
>>> - Kyle
>>>
>>> On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu <fl...@gmail.com> wrote:
>>>
>>>> It is a good practice to create a template for Github Issues and PRs
>>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>. It
>>>> will make a PR or issue easier to read overall. We can also enforce/suggest
>>>> properties like affected versions and fixed versions.
>>>>
>>>> Best,
>>>> Yufei
>>>>
>>>>
>>>> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon
>>>> <wy...@cloudera.com.invalid> wrote:
>>>>
>>>>> I wasn't aware that we were standardizing on such a practice. I don't
>>>>> have a strong opinion on making changes one Spark version at a time or all
>>>>> at once. I think committers who do reviews regularly should decide. My only
>>>>> concern with making changes one version at a time is follow-through on the
>>>>> part of the contributor. We want to ensure that a change/fix applicable to
>>>>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>>>>> either way. (For simple changes/fixes, perhaps all at once does save
>>>>> reviewer bandwidth, so we may want to be flexible.)
>>>>>
>>>>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for bringing this up Kyle!
>>>>>>
>>>>>> My personal view is the following:
>>>>>> 1. For new features, it should be very clear that we always implement
>>>>>> them against the latest version. At the same time, I suggest we create an
>>>>>> issue to track backport, so that if anyone is interested in backport he/she
>>>>>> can work on it separately. We can tag these issues based on title names
>>>>>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>>>>>> contributors to work on because there is already reference implementation
>>>>>> in a newer version.
>>>>>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>>>>>> people will try to just fix across versions. My take is that we should try
>>>>>> to advocate for fixing 1 version and open an issue for other versions
>>>>>> although it does not really need to be enforced as strictly. Sometimes even
>>>>>> a 1 line fix has serious implications for different versions and might
>>>>>> break stuff unintentionally. it's better that people that have production
>>>>>> dependency on the specific version carefully review and test changes before
>>>>>> merging.
>>>>>>
>>>>>> About enforcement strategy, I suggest we start to create a template
>>>>>> for Github Issues and PRs
>>>>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>>>>>> where we state the guidelines related to engine versions, as well as
>>>>>> Iceberg's preferred code style, naming convention, title convention, etc.
>>>>>> to make new contributors a bit easier to submit changes without too much
>>>>>> rewrite. Currently I observe that every time there is a new contributor, we
>>>>>> need to state all the guidelines through PR review, which causes quite a
>>>>>> lot of time spent on rewriting the code and also reduces the motivation for
>>>>>> people to continue work on the PR.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io>
>>>>>> wrote:
>>>>>>
>>>>>>> I submitted a PR to fix a Spark bug today, applying the same changes
>>>>>>> to all eligible Spark versions.
>>>>>>>
>>>>>>> Jack mentioned that he thought the practice going forward was to fix
>>>>>>> / apply changes on the latest Spark version in one PR, and then open a
>>>>>>> second PR to backport the fixes (presumably to minimize review overhead).
>>>>>>>
>>>>>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>>>>>> certain, so I thought I'd ask here.
>>>>>>>
>>>>>>> Seems like a good practice but hoping to get some clarification :)
>>>>>>>
>>>>>>> --
>>>>>>> Best,
>>>>>>> Kyle Bendickson
>>>>>>> Github: @kbendick
>>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Standard practices around PRs against multiple Spark versions

Posted by Jack Ye <ye...@gmail.com>.
definitely +1 for having documentations to point to instead of laying
everything out in the PR template!

-Jack

On Fri, Nov 5, 2021 at 2:55 PM Ryan Blue <bl...@tabular.io> wrote:

> I agree with Jack about keeping changes targeted at one Spark version.
> That makes it possible to revert changes to one Spark version rather than
> manually rolling back. And it also keeps PRs smaller and more focused. I
> think it wastes more contributor time to make updates to a PR in 3 or 4
> places and keep them in sync.
>
> For the template for issues and PRs, I'm okay with that, but let's not get
> carried away. Let's document style, standards, and practices and link to
> that doc rather than starting off with a wall of text. The templates for
> other projects, like Parquet, have so much content that isn't useful that I
> think they're encouraging people with small commits to abandon the effort.
>
> Ryan
>
> On Fri, Nov 5, 2021 at 1:06 PM Kyle Bendickson <ky...@tabular.io> wrote:
>
>> I like Yufei's idea of adding a template.
>>
>> As per the concern about making changes separate and having things fall
>> through the cracks, I do share that concern.
>>
>> On a related note though, I've already observed that a few times with
>> simple things that maybe people were working on before the split up. These
>> will go away with time, though could happen in the future when we add a new
>> version.
>>
>> I agree a PR template would go a long way in ensuring that people are at
>> least aware.
>>
>> It would be great if more folks who maintain forks could chime in as
>> well, as this has ramifications for that as well.
>>
>> But overall, especially for larger PRs, I think it's a great idea.
>>
>> - Kyle
>>
>> On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu <fl...@gmail.com> wrote:
>>
>>> It is a good practice to create a template for Github Issues and PRs
>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>. It
>>> will make a PR or issue easier to read overall. We can also enforce/suggest
>>> properties like affected versions and fixed versions.
>>>
>>> Best,
>>> Yufei
>>>
>>>
>>> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon <wy...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> I wasn't aware that we were standardizing on such a practice. I don't
>>>> have a strong opinion on making changes one Spark version at a time or all
>>>> at once. I think committers who do reviews regularly should decide. My only
>>>> concern with making changes one version at a time is follow-through on the
>>>> part of the contributor. We want to ensure that a change/fix applicable to
>>>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>>>> either way. (For simple changes/fixes, perhaps all at once does save
>>>> reviewer bandwidth, so we may want to be flexible.)
>>>>
>>>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Thanks for bringing this up Kyle!
>>>>>
>>>>> My personal view is the following:
>>>>> 1. For new features, it should be very clear that we always implement
>>>>> them against the latest version. At the same time, I suggest we create an
>>>>> issue to track backport, so that if anyone is interested in backport he/she
>>>>> can work on it separately. We can tag these issues based on title names
>>>>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>>>>> contributors to work on because there is already reference implementation
>>>>> in a newer version.
>>>>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>>>>> people will try to just fix across versions. My take is that we should try
>>>>> to advocate for fixing 1 version and open an issue for other versions
>>>>> although it does not really need to be enforced as strictly. Sometimes even
>>>>> a 1 line fix has serious implications for different versions and might
>>>>> break stuff unintentionally. it's better that people that have production
>>>>> dependency on the specific version carefully review and test changes before
>>>>> merging.
>>>>>
>>>>> About enforcement strategy, I suggest we start to create a template
>>>>> for Github Issues and PRs
>>>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>>>>> where we state the guidelines related to engine versions, as well as
>>>>> Iceberg's preferred code style, naming convention, title convention, etc.
>>>>> to make new contributors a bit easier to submit changes without too much
>>>>> rewrite. Currently I observe that every time there is a new contributor, we
>>>>> need to state all the guidelines through PR review, which causes quite a
>>>>> lot of time spent on rewriting the code and also reduces the motivation for
>>>>> people to continue work on the PR.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io>
>>>>> wrote:
>>>>>
>>>>>> I submitted a PR to fix a Spark bug today, applying the same changes
>>>>>> to all eligible Spark versions.
>>>>>>
>>>>>> Jack mentioned that he thought the practice going forward was to fix
>>>>>> / apply changes on the latest Spark version in one PR, and then open a
>>>>>> second PR to backport the fixes (presumably to minimize review overhead).
>>>>>>
>>>>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>>>>> certain, so I thought I'd ask here.
>>>>>>
>>>>>> Seems like a good practice but hoping to get some clarification :)
>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Kyle Bendickson
>>>>>> Github: @kbendick
>>>>>>
>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Standard practices around PRs against multiple Spark versions

Posted by Ryan Blue <bl...@tabular.io>.
I agree with Jack about keeping changes targeted at one Spark version. That
makes it possible to revert changes to one Spark version rather than
manually rolling back. And it also keeps PRs smaller and more focused. I
think it wastes more contributor time to make updates to a PR in 3 or 4
places and keep them in sync.

For the template for issues and PRs, I'm okay with that, but let's not get
carried away. Let's document style, standards, and practices and link to
that doc rather than starting off with a wall of text. The templates for
other projects, like Parquet, have so much content that isn't useful that I
think they're encouraging people with small commits to abandon the effort.

Ryan

On Fri, Nov 5, 2021 at 1:06 PM Kyle Bendickson <ky...@tabular.io> wrote:

> I like Yufei's idea of adding a template.
>
> As per the concern about making changes separate and having things fall
> through the cracks, I do share that concern.
>
> On a related note though, I've already observed that a few times with
> simple things that maybe people were working on before the split up. These
> will go away with time, though could happen in the future when we add a new
> version.
>
> I agree a PR template would go a long way in ensuring that people are at
> least aware.
>
> It would be great if more folks who maintain forks could chime in as well,
> as this has ramifications for that as well.
>
> But overall, especially for larger PRs, I think it's a great idea.
>
> - Kyle
>
> On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu <fl...@gmail.com> wrote:
>
>> It is a good practice to create a template for Github Issues and PRs
>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>. It
>> will make a PR or issue easier to read overall. We can also enforce/suggest
>> properties like affected versions and fixed versions.
>>
>> Best,
>> Yufei
>>
>>
>> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon <wy...@cloudera.com.invalid>
>> wrote:
>>
>>> I wasn't aware that we were standardizing on such a practice. I don't
>>> have a strong opinion on making changes one Spark version at a time or all
>>> at once. I think committers who do reviews regularly should decide. My only
>>> concern with making changes one version at a time is follow-through on the
>>> part of the contributor. We want to ensure that a change/fix applicable to
>>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>>> either way. (For simple changes/fixes, perhaps all at once does save
>>> reviewer bandwidth, so we may want to be flexible.)
>>>
>>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Thanks for bringing this up Kyle!
>>>>
>>>> My personal view is the following:
>>>> 1. For new features, it should be very clear that we always implement
>>>> them against the latest version. At the same time, I suggest we create an
>>>> issue to track backport, so that if anyone is interested in backport he/she
>>>> can work on it separately. We can tag these issues based on title names
>>>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>>>> contributors to work on because there is already reference implementation
>>>> in a newer version.
>>>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>>>> people will try to just fix across versions. My take is that we should try
>>>> to advocate for fixing 1 version and open an issue for other versions
>>>> although it does not really need to be enforced as strictly. Sometimes even
>>>> a 1 line fix has serious implications for different versions and might
>>>> break stuff unintentionally. it's better that people that have production
>>>> dependency on the specific version carefully review and test changes before
>>>> merging.
>>>>
>>>> About enforcement strategy, I suggest we start to create a template
>>>> for Github Issues and PRs
>>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>>>> where we state the guidelines related to engine versions, as well as
>>>> Iceberg's preferred code style, naming convention, title convention, etc.
>>>> to make new contributors a bit easier to submit changes without too much
>>>> rewrite. Currently I observe that every time there is a new contributor, we
>>>> need to state all the guidelines through PR review, which causes quite a
>>>> lot of time spent on rewriting the code and also reduces the motivation for
>>>> people to continue work on the PR.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io> wrote:
>>>>
>>>>> I submitted a PR to fix a Spark bug today, applying the same changes
>>>>> to all eligible Spark versions.
>>>>>
>>>>> Jack mentioned that he thought the practice going forward was to fix /
>>>>> apply changes on the latest Spark version in one PR, and then open a second
>>>>> PR to backport the fixes (presumably to minimize review overhead).
>>>>>
>>>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>>>> certain, so I thought I'd ask here.
>>>>>
>>>>> Seems like a good practice but hoping to get some clarification :)
>>>>>
>>>>> --
>>>>> Best,
>>>>> Kyle Bendickson
>>>>> Github: @kbendick
>>>>>
>>>>

-- 
Ryan Blue
Tabular

Re: Standard practices around PRs against multiple Spark versions

Posted by Kyle Bendickson <ky...@tabular.io>.
I like Yufei's idea of adding a template.

As per the concern about making changes separate and having things fall
through the cracks, I do share that concern.

On a related note though, I've already observed that a few times with
simple things that maybe people were working on before the split up. These
will go away with time, though could happen in the future when we add a new
version.

I agree a PR template would go a long way in ensuring that people are at
least aware.

It would be great if more folks who maintain forks could chime in as well,
as this has ramifications for that as well.

But overall, especially for larger PRs, I think it's a great idea.

- Kyle

On Fri, Nov 5, 2021 at 10:34 AM Yufei Gu <fl...@gmail.com> wrote:

> It is a good practice to create a template for Github Issues and PRs
> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>. It
> will make a PR or issue easier to read overall. We can also enforce/suggest
> properties like affected versions and fixed versions.
>
> Best,
> Yufei
>
>
> On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon <wy...@cloudera.com.invalid>
> wrote:
>
>> I wasn't aware that we were standardizing on such a practice. I don't
>> have a strong opinion on making changes one Spark version at a time or all
>> at once. I think committers who do reviews regularly should decide. My only
>> concern with making changes one version at a time is follow-through on the
>> part of the contributor. We want to ensure that a change/fix applicable to
>> multiple versions gets into all of them. Reviewer bandwidth is incurred
>> either way. (For simple changes/fixes, perhaps all at once does save
>> reviewer bandwidth, so we may want to be flexible.)
>>
>> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Thanks for bringing this up Kyle!
>>>
>>> My personal view is the following:
>>> 1. For new features, it should be very clear that we always implement
>>> them against the latest version. At the same time, I suggest we create an
>>> issue to track backport, so that if anyone is interested in backport he/she
>>> can work on it separately. We can tag these issues based on title names
>>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>>> contributors to work on because there is already reference implementation
>>> in a newer version.
>>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>>> people will try to just fix across versions. My take is that we should try
>>> to advocate for fixing 1 version and open an issue for other versions
>>> although it does not really need to be enforced as strictly. Sometimes even
>>> a 1 line fix has serious implications for different versions and might
>>> break stuff unintentionally. it's better that people that have production
>>> dependency on the specific version carefully review and test changes before
>>> merging.
>>>
>>> About enforcement strategy, I suggest we start to create a template for
>>> Github Issues and PRs
>>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>>> where we state the guidelines related to engine versions, as well as
>>> Iceberg's preferred code style, naming convention, title convention, etc.
>>> to make new contributors a bit easier to submit changes without too much
>>> rewrite. Currently I observe that every time there is a new contributor, we
>>> need to state all the guidelines through PR review, which causes quite a
>>> lot of time spent on rewriting the code and also reduces the motivation for
>>> people to continue work on the PR.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io> wrote:
>>>
>>>> I submitted a PR to fix a Spark bug today, applying the same changes to
>>>> all eligible Spark versions.
>>>>
>>>> Jack mentioned that he thought the practice going forward was to fix /
>>>> apply changes on the latest Spark version in one PR, and then open a second
>>>> PR to backport the fixes (presumably to minimize review overhead).
>>>>
>>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>>> certain, so I thought I'd ask here.
>>>>
>>>> Seems like a good practice but hoping to get some clarification :)
>>>>
>>>> --
>>>> Best,
>>>> Kyle Bendickson
>>>> Github: @kbendick
>>>>
>>>

Re: Standard practices around PRs against multiple Spark versions

Posted by Yufei Gu <fl...@gmail.com>.
It is a good practice to create a template for Github Issues and PRs
<https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>.
It
will make a PR or issue easier to read overall. We can also enforce/suggest
properties like affected versions and fixed versions.

Best,
Yufei


On Wed, Nov 3, 2021 at 6:02 PM Wing Yew Poon <wy...@cloudera.com.invalid>
wrote:

> I wasn't aware that we were standardizing on such a practice. I don't have
> a strong opinion on making changes one Spark version at a time or all at
> once. I think committers who do reviews regularly should decide. My only
> concern with making changes one version at a time is follow-through on the
> part of the contributor. We want to ensure that a change/fix applicable to
> multiple versions gets into all of them. Reviewer bandwidth is incurred
> either way. (For simple changes/fixes, perhaps all at once does save
> reviewer bandwidth, so we may want to be flexible.)
>
> On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Thanks for bringing this up Kyle!
>>
>> My personal view is the following:
>> 1. For new features, it should be very clear that we always implement
>> them against the latest version. At the same time, I suggest we create an
>> issue to track backport, so that if anyone is interested in backport he/she
>> can work on it separately. We can tag these issues based on title names
>> (e.g. "Backport: xxx" as title), and these are also good issues for new
>> contributors to work on because there is already reference implementation
>> in a newer version.
>> 2. For bug fixes, I understand sometimes it's just a one line fix and
>> people will try to just fix across versions. My take is that we should try
>> to advocate for fixing 1 version and open an issue for other versions
>> although it does not really need to be enforced as strictly. Sometimes even
>> a 1 line fix has serious implications for different versions and might
>> break stuff unintentionally. it's better that people that have production
>> dependency on the specific version carefully review and test changes before
>> merging.
>>
>> About enforcement strategy, I suggest we start to create a template for
>> Github Issues and PRs
>> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
>> where we state the guidelines related to engine versions, as well as
>> Iceberg's preferred code style, naming convention, title convention, etc.
>> to make new contributors a bit easier to submit changes without too much
>> rewrite. Currently I observe that every time there is a new contributor, we
>> need to state all the guidelines through PR review, which causes quite a
>> lot of time spent on rewriting the code and also reduces the motivation for
>> people to continue work on the PR.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>>
>>
>> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io> wrote:
>>
>>> I submitted a PR to fix a Spark bug today, applying the same changes to
>>> all eligible Spark versions.
>>>
>>> Jack mentioned that he thought the practice going forward was to fix /
>>> apply changes on the latest Spark version in one PR, and then open a second
>>> PR to backport the fixes (presumably to minimize review overhead).
>>>
>>> Do we have a standard / preference on that? Jack mentioned he wasn't
>>> certain, so I thought I'd ask here.
>>>
>>> Seems like a good practice but hoping to get some clarification :)
>>>
>>> --
>>> Best,
>>> Kyle Bendickson
>>> Github: @kbendick
>>>
>>

Re: Standard practices around PRs against multiple Spark versions

Posted by Wing Yew Poon <wy...@cloudera.com.INVALID>.
I wasn't aware that we were standardizing on such a practice. I don't have
a strong opinion on making changes one Spark version at a time or all at
once. I think committers who do reviews regularly should decide. My only
concern with making changes one version at a time is follow-through on the
part of the contributor. We want to ensure that a change/fix applicable to
multiple versions gets into all of them. Reviewer bandwidth is incurred
either way. (For simple changes/fixes, perhaps all at once does save
reviewer bandwidth, so we may want to be flexible.)

On Wed, Nov 3, 2021 at 5:52 PM Jack Ye <ye...@gmail.com> wrote:

> Thanks for bringing this up Kyle!
>
> My personal view is the following:
> 1. For new features, it should be very clear that we always implement them
> against the latest version. At the same time, I suggest we create an issue
> to track backport, so that if anyone is interested in backport he/she can
> work on it separately. We can tag these issues based on title names (e.g.
> "Backport: xxx" as title), and these are also good issues for new
> contributors to work on because there is already reference implementation
> in a newer version.
> 2. For bug fixes, I understand sometimes it's just a one line fix and
> people will try to just fix across versions. My take is that we should try
> to advocate for fixing 1 version and open an issue for other versions
> although it does not really need to be enforced as strictly. Sometimes even
> a 1 line fix has serious implications for different versions and might
> break stuff unintentionally. it's better that people that have production
> dependency on the specific version carefully review and test changes before
> merging.
>
> About enforcement strategy, I suggest we start to create a template for
> Github Issues and PRs
> <https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
> where we state the guidelines related to engine versions, as well as
> Iceberg's preferred code style, naming convention, title convention, etc.
> to make new contributors a bit easier to submit changes without too much
> rewrite. Currently I observe that every time there is a new contributor, we
> need to state all the guidelines through PR review, which causes quite a
> lot of time spent on rewriting the code and also reduces the motivation for
> people to continue work on the PR.
>
> Best,
> Jack Ye
>
>
>
>
>
> On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io> wrote:
>
>> I submitted a PR to fix a Spark bug today, applying the same changes to
>> all eligible Spark versions.
>>
>> Jack mentioned that he thought the practice going forward was to fix /
>> apply changes on the latest Spark version in one PR, and then open a second
>> PR to backport the fixes (presumably to minimize review overhead).
>>
>> Do we have a standard / preference on that? Jack mentioned he wasn't
>> certain, so I thought I'd ask here.
>>
>> Seems like a good practice but hoping to get some clarification :)
>>
>> --
>> Best,
>> Kyle Bendickson
>> Github: @kbendick
>>
>

Re: Standard practices around PRs against multiple Spark versions

Posted by Jack Ye <ye...@gmail.com>.
Thanks for bringing this up Kyle!

My personal view is the following:
1. For new features, it should be very clear that we always implement them
against the latest version. At the same time, I suggest we create an issue
to track backport, so that if anyone is interested in backport he/she can
work on it separately. We can tag these issues based on title names (e.g.
"Backport: xxx" as title), and these are also good issues for new
contributors to work on because there is already reference implementation
in a newer version.
2. For bug fixes, I understand sometimes it's just a one line fix and
people will try to just fix across versions. My take is that we should try
to advocate for fixing 1 version and open an issue for other versions
although it does not really need to be enforced as strictly. Sometimes even
a 1 line fix has serious implications for different versions and might
break stuff unintentionally. it's better that people that have production
dependency on the specific version carefully review and test changes before
merging.

About enforcement strategy, I suggest we start to create a template for
Github Issues and PRs
<https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository>,
where we state the guidelines related to engine versions, as well as
Iceberg's preferred code style, naming convention, title convention, etc.
to make new contributors a bit easier to submit changes without too much
rewrite. Currently I observe that every time there is a new contributor, we
need to state all the guidelines through PR review, which causes quite a
lot of time spent on rewriting the code and also reduces the motivation for
people to continue work on the PR.

Best,
Jack Ye





On Wed, Nov 3, 2021 at 4:13 PM Kyle Bendickson <ky...@tabular.io> wrote:

> I submitted a PR to fix a Spark bug today, applying the same changes to
> all eligible Spark versions.
>
> Jack mentioned that he thought the practice going forward was to fix /
> apply changes on the latest Spark version in one PR, and then open a second
> PR to backport the fixes (presumably to minimize review overhead).
>
> Do we have a standard / preference on that? Jack mentioned he wasn't
> certain, so I thought I'd ask here.
>
> Seems like a good practice but hoping to get some clarification :)
>
> --
> Best,
> Kyle Bendickson
> Github: @kbendick
>