You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2021/09/24 14:39:32 UTC

Commit message guidelines

Hi all,

I think we all more or less follow some standard pattern when committing in
Hive but with some small effort we could make things more uniform and
hopefully better.
I would like to start a discussion about creating some guidelines, which we
could put to the wiki or in contributing.md, to improve the quality of our
history (git log).
I outline some suggestions below to kick off the discussion. Many things in
the list are minor (and maybe even personal preferences) but one thing
which is really missing from the project is B3 especially the *why* part.
Why is the commit necessary? Why has the change been made?.
In some cases the why part is also missing from the JIRA making the code
harder to maintain.

Subject line:
S1. Start with the Jira id capitalized and followed immediately (no space)
by double colon (:)
S2. Leave one space after the Jira id, and start the summary with a capital
letter
S3. Keep it concise (ideally less than 72 characters) and provide a useful
description of the change
S4. Do not include or end the line with period
S5. Do not include the pull request id in the summary
S6. Use imperative mood (“Add a handler …”) rather than past tense (“Added
a handler …”) or present tense (“Adds a handler …”)
S7. Avoid using "Fix"; If you are fixing a bug, it is sufficient to
describe the bug (“NullPointerException if user is unknown”) and people
will correctly surmise that the purpose of your change is to fix the bug.
S8. Do not add a contributor's name; the author tag is made exactly for
this and can be explored/parsed much more efficiently by tools/people for
stats or other purposes
S9. Do not add reviewers name; information is present in multiple places
(e.g., committer tag, PR, JIRA)

Message body: (Trivial changes may not require a body)
B1. Separate subject from body with a blank line
B2. Wrap the body at 72 characters
B3. Use the body to explain what and why vs. how
Example
"Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan and
Converter to avoid executing the fallback method which in many cases
returns null and can cause NPE when this value propagates up the call
stack."
vs.
"Added handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
and Converter"
B4. If multiple authors include them using the standard GitHub marker,
"Co-authored-by:", followed by the name and email of the author (e.g.,
Co-authored-by: Marton Bod <mb...@cloudera.com>)
B5. If the reviewer is different from committer (or merge via GitHub UI)
use "Reviewed-by:" followed by the name and email of the reviewer (e.g.,
Reviewed-by: Stamatis Zampetakis <za...@gmail.com>)
B6. Use "Co-authored-by"/"Reviewed-by" on each own line and repeat as many
times as authors/reviewers.
B7. Include the PR id at the end of the message (e.g., Closes #2514);
someone can easily navigate back to the PR to check comments, reviewers,
etc.

A sample commit message following these guidelines is shown below:

commit de7781f29f82083fe01274b4d436b52920a89173
Author: Soumyakanti Das <so...@gmail.com>
Commit: Stamatis Zampetakis <za...@gmail.com>

    HIVE-25354: NPE when estimating row count in external JDBC tables

    Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
    and Converter to avoid executing the fallback method which in many
    cases returns null and can cause NPE when this value propagates up the
    call stack.

    Co-authored-by: Krisztian Kasa <kk...@cloudera.com>

    Reviewed-by: Peter Vary <pv...@cloudera.com>
    Reviewed-by: Zoltan Haindrich <ki...@rxd.hu>

    Closes #2514

Let me know your thoughts.

Best,
Stamatis

Re: Commit message guidelines

Posted by Stamatis Zampetakis <za...@gmail.com>.
Hi,

Part of these guidelines can be enforced by setting up webhooks [1], GitHub
actions [2], branch protection rules [3], but the most important enforcer
IMO is the reviewer/committer.
I am mostly interested in the quality of the message rather than if it has
a period in the end (.) or if it misses a space, although I have to say I
like things to be uniform :)
The first step would be for the community to express some interest about
having some guidelines (not necessarily these ones) and put them somewhere
so we can follow them; thanks to Alessandro and Peter who did that already.

The main usage of the commit summary from the majority of people is to find
and understand a specific change. The names there are redundant (since they
appear elsewhere) add clutter and some small editing overhead when
committing.
I also want to extract information from time to time from commits but this
is not something that is needed very often. Usually, I rely on custom
format (e.g., git log --format=format:"%h %aN %cN %s") or other means (SQL
via Calcite :)) to get what I want.

The 72 character line length (excluding JIRA id) is more important on the
summary and it is not supposed to be a hard limit. It is something that
could make people think a bit more when editing the message to find better
words.

About the length of the body, I don't really mind. I would be happy if
there was one in the first place. The majority of commits in the repo do
not have a body which is kind of problematic. Certainly it is not needed
when the change is trivial but I doubt it is in 90% of the cases.
There are people who write everything in the same line and rely on the
editors' wrapping and others who always wrap based on their preferences.
Preferences can vary so again this is an attempt to make things a bit more
uniform.

Best,
Stamatis

[1]
https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks
[2] https://github.com/marketplace/actions/gs-commit-message-checker
[3]
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/defining-the-mergeability-of-pull-requests/about-protected-branches#require-status-checks-before-merging


On Tue, Sep 28, 2021 at 6:29 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Stamatis,
>
> I am generally positive for every standardization process. My experience
> thought me that a standardized code is usually better in the long run than
> any super optimized solution where we specialize every codepath for the
> locally best optimum.
>
> First question:
> - Is it possible to check/enforce these guidelines somehow?
>
> If we have some tool to enforce this, that would be super helpful.
>
> I agree with most of your points.
>
> There are a few question marks, but I can live with all of the proposed
> requirements if that is the consensus:
> - I think having contributor / reviewer / PR number in the commit message
> is helpful (at least for me). I use the `git log --pretty=online|grep`
> quite often, and one usage is when I would like to collect the PR's which
> were authored or reviewed by a contributor
> - I am not sure about the 72 character line length - the PRs are most
> often viewed by tools which do the wrapping based on the available screen
> size. Wouldn't we want to relay on that?
>
> Thanks,
> Peter
>
> > On Sep 28, 2021, at 11:25, Alessandro Solimando <
> alessandro.solimando@gmail.com> wrote:
> >
> > Hi Stamatis,
> > thanks for the suggestions, I think they are reasonable, the project
> would
> > benefit from their adoption.
> >
> > Regarding the removal of contributor/reviewers names from the commit
> > message favoring the use of git metadata, there has been a similar
> discussion
> > in Calcite ML
> > <
> https://mail-archives.apache.org/mod_mbox/calcite-dev/202109.mbox/%3cCAFQnWdZYs+gEmEEEvHrEGgJeDpMJn5Bc4bvECdusgrTcYkVvHw@mail.gmail.com%3e
> >
> > which
> > led to consensus.
> >
> > Best regards,
> > Alessandro
> >
> > On Fri, 24 Sept 2021 at 16:40, Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> I think we all more or less follow some standard pattern when
> committing in
> >> Hive but with some small effort we could make things more uniform and
> >> hopefully better.
> >> I would like to start a discussion about creating some guidelines,
> which we
> >> could put to the wiki or in contributing.md, to improve the quality of
> our
> >> history (git log).
> >> I outline some suggestions below to kick off the discussion. Many
> things in
> >> the list are minor (and maybe even personal preferences) but one thing
> >> which is really missing from the project is B3 especially the *why*
> part.
> >> Why is the commit necessary? Why has the change been made?.
> >> In some cases the why part is also missing from the JIRA making the code
> >> harder to maintain.
> >>
> >> Subject line:
> >> S1. Start with the Jira id capitalized and followed immediately (no
> space)
> >> by double colon (:)
> >> S2. Leave one space after the Jira id, and start the summary with a
> capital
> >> letter
> >> S3. Keep it concise (ideally less than 72 characters) and provide a
> useful
> >> description of the change
> >> S4. Do not include or end the line with period
> >> S5. Do not include the pull request id in the summary
> >> S6. Use imperative mood (“Add a handler …”) rather than past tense
> (“Added
> >> a handler …”) or present tense (“Adds a handler …”)
> >> S7. Avoid using "Fix"; If you are fixing a bug, it is sufficient to
> >> describe the bug (“NullPointerException if user is unknown”) and people
> >> will correctly surmise that the purpose of your change is to fix the
> bug.
> >> S8. Do not add a contributor's name; the author tag is made exactly for
> >> this and can be explored/parsed much more efficiently by tools/people
> for
> >> stats or other purposes
> >> S9. Do not add reviewers name; information is present in multiple places
> >> (e.g., committer tag, PR, JIRA)
> >>
> >> Message body: (Trivial changes may not require a body)
> >> B1. Separate subject from body with a blank line
> >> B2. Wrap the body at 72 characters
> >> B3. Use the body to explain what and why vs. how
> >> Example
> >> "Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
> and
> >> Converter to avoid executing the fallback method which in many cases
> >> returns null and can cause NPE when this value propagates up the call
> >> stack."
> >> vs.
> >> "Added handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
> >> and Converter"
> >> B4. If multiple authors include them using the standard GitHub marker,
> >> "Co-authored-by:", followed by the name and email of the author (e.g.,
> >> Co-authored-by: Marton Bod <mb...@cloudera.com>)
> >> B5. If the reviewer is different from committer (or merge via GitHub UI)
> >> use "Reviewed-by:" followed by the name and email of the reviewer (e.g.,
> >> Reviewed-by: Stamatis Zampetakis <za...@gmail.com>)
> >> B6. Use "Co-authored-by"/"Reviewed-by" on each own line and repeat as
> many
> >> times as authors/reviewers.
> >> B7. Include the PR id at the end of the message (e.g., Closes #2514);
> >> someone can easily navigate back to the PR to check comments, reviewers,
> >> etc.
> >>
> >> A sample commit message following these guidelines is shown below:
> >>
> >> commit de7781f29f82083fe01274b4d436b52920a89173
> >> Author: Soumyakanti Das <so...@gmail.com>
> >> Commit: Stamatis Zampetakis <za...@gmail.com>
> >>
> >>    HIVE-25354: NPE when estimating row count in external JDBC tables
> >>
> >>    Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
> >>    and Converter to avoid executing the fallback method which in many
> >>    cases returns null and can cause NPE when this value propagates up
> the
> >>    call stack.
> >>
> >>    Co-authored-by: Krisztian Kasa <kk...@cloudera.com>
> >>
> >>    Reviewed-by: Peter Vary <pv...@cloudera.com>
> >>    Reviewed-by: Zoltan Haindrich <ki...@rxd.hu>
> >>
> >>    Closes #2514
> >>
> >> Let me know your thoughts.
> >>
> >> Best,
> >> Stamatis
> >>
>
>

Re: Commit message guidelines

Posted by Peter Vary <pv...@cloudera.com.INVALID>.
Hi Stamatis,

I am generally positive for every standardization process. My experience thought me that a standardized code is usually better in the long run than any super optimized solution where we specialize every codepath for the locally best optimum.

First question:
- Is it possible to check/enforce these guidelines somehow?

If we have some tool to enforce this, that would be super helpful.

I agree with most of your points.

There are a few question marks, but I can live with all of the proposed requirements if that is the consensus:
- I think having contributor / reviewer / PR number in the commit message is helpful (at least for me). I use the `git log --pretty=online|grep` quite often, and one usage is when I would like to collect the PR's which were authored or reviewed by a contributor
- I am not sure about the 72 character line length - the PRs are most often viewed by tools which do the wrapping based on the available screen size. Wouldn't we want to relay on that?

Thanks,
Peter

> On Sep 28, 2021, at 11:25, Alessandro Solimando <al...@gmail.com> wrote:
> 
> Hi Stamatis,
> thanks for the suggestions, I think they are reasonable, the project would
> benefit from their adoption.
> 
> Regarding the removal of contributor/reviewers names from the commit
> message favoring the use of git metadata, there has been a similar discussion
> in Calcite ML
> <https://mail-archives.apache.org/mod_mbox/calcite-dev/202109.mbox/%3cCAFQnWdZYs+gEmEEEvHrEGgJeDpMJn5Bc4bvECdusgrTcYkVvHw@mail.gmail.com%3e>
> which
> led to consensus.
> 
> Best regards,
> Alessandro
> 
> On Fri, 24 Sept 2021 at 16:40, Stamatis Zampetakis <za...@gmail.com>
> wrote:
> 
>> Hi all,
>> 
>> I think we all more or less follow some standard pattern when committing in
>> Hive but with some small effort we could make things more uniform and
>> hopefully better.
>> I would like to start a discussion about creating some guidelines, which we
>> could put to the wiki or in contributing.md, to improve the quality of our
>> history (git log).
>> I outline some suggestions below to kick off the discussion. Many things in
>> the list are minor (and maybe even personal preferences) but one thing
>> which is really missing from the project is B3 especially the *why* part.
>> Why is the commit necessary? Why has the change been made?.
>> In some cases the why part is also missing from the JIRA making the code
>> harder to maintain.
>> 
>> Subject line:
>> S1. Start with the Jira id capitalized and followed immediately (no space)
>> by double colon (:)
>> S2. Leave one space after the Jira id, and start the summary with a capital
>> letter
>> S3. Keep it concise (ideally less than 72 characters) and provide a useful
>> description of the change
>> S4. Do not include or end the line with period
>> S5. Do not include the pull request id in the summary
>> S6. Use imperative mood (“Add a handler …”) rather than past tense (“Added
>> a handler …”) or present tense (“Adds a handler …”)
>> S7. Avoid using "Fix"; If you are fixing a bug, it is sufficient to
>> describe the bug (“NullPointerException if user is unknown”) and people
>> will correctly surmise that the purpose of your change is to fix the bug.
>> S8. Do not add a contributor's name; the author tag is made exactly for
>> this and can be explored/parsed much more efficiently by tools/people for
>> stats or other purposes
>> S9. Do not add reviewers name; information is present in multiple places
>> (e.g., committer tag, PR, JIRA)
>> 
>> Message body: (Trivial changes may not require a body)
>> B1. Separate subject from body with a blank line
>> B2. Wrap the body at 72 characters
>> B3. Use the body to explain what and why vs. how
>> Example
>> "Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan and
>> Converter to avoid executing the fallback method which in many cases
>> returns null and can cause NPE when this value propagates up the call
>> stack."
>> vs.
>> "Added handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
>> and Converter"
>> B4. If multiple authors include them using the standard GitHub marker,
>> "Co-authored-by:", followed by the name and email of the author (e.g.,
>> Co-authored-by: Marton Bod <mb...@cloudera.com>)
>> B5. If the reviewer is different from committer (or merge via GitHub UI)
>> use "Reviewed-by:" followed by the name and email of the reviewer (e.g.,
>> Reviewed-by: Stamatis Zampetakis <za...@gmail.com>)
>> B6. Use "Co-authored-by"/"Reviewed-by" on each own line and repeat as many
>> times as authors/reviewers.
>> B7. Include the PR id at the end of the message (e.g., Closes #2514);
>> someone can easily navigate back to the PR to check comments, reviewers,
>> etc.
>> 
>> A sample commit message following these guidelines is shown below:
>> 
>> commit de7781f29f82083fe01274b4d436b52920a89173
>> Author: Soumyakanti Das <so...@gmail.com>
>> Commit: Stamatis Zampetakis <za...@gmail.com>
>> 
>>    HIVE-25354: NPE when estimating row count in external JDBC tables
>> 
>>    Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
>>    and Converter to avoid executing the fallback method which in many
>>    cases returns null and can cause NPE when this value propagates up the
>>    call stack.
>> 
>>    Co-authored-by: Krisztian Kasa <kk...@cloudera.com>
>> 
>>    Reviewed-by: Peter Vary <pv...@cloudera.com>
>>    Reviewed-by: Zoltan Haindrich <ki...@rxd.hu>
>> 
>>    Closes #2514
>> 
>> Let me know your thoughts.
>> 
>> Best,
>> Stamatis
>> 


Re: Commit message guidelines

Posted by Alessandro Solimando <al...@gmail.com>.
Hi Stamatis,
thanks for the suggestions, I think they are reasonable, the project would
benefit from their adoption.

Regarding the removal of contributor/reviewers names from the commit
message favoring the use of git metadata, there has been a similar discussion
in Calcite ML
<https://mail-archives.apache.org/mod_mbox/calcite-dev/202109.mbox/%3cCAFQnWdZYs+gEmEEEvHrEGgJeDpMJn5Bc4bvECdusgrTcYkVvHw@mail.gmail.com%3e>
which
led to consensus.

Best regards,
Alessandro

On Fri, 24 Sept 2021 at 16:40, Stamatis Zampetakis <za...@gmail.com>
wrote:

> Hi all,
>
> I think we all more or less follow some standard pattern when committing in
> Hive but with some small effort we could make things more uniform and
> hopefully better.
> I would like to start a discussion about creating some guidelines, which we
> could put to the wiki or in contributing.md, to improve the quality of our
> history (git log).
> I outline some suggestions below to kick off the discussion. Many things in
> the list are minor (and maybe even personal preferences) but one thing
> which is really missing from the project is B3 especially the *why* part.
> Why is the commit necessary? Why has the change been made?.
> In some cases the why part is also missing from the JIRA making the code
> harder to maintain.
>
> Subject line:
> S1. Start with the Jira id capitalized and followed immediately (no space)
> by double colon (:)
> S2. Leave one space after the Jira id, and start the summary with a capital
> letter
> S3. Keep it concise (ideally less than 72 characters) and provide a useful
> description of the change
> S4. Do not include or end the line with period
> S5. Do not include the pull request id in the summary
> S6. Use imperative mood (“Add a handler …”) rather than past tense (“Added
> a handler …”) or present tense (“Adds a handler …”)
> S7. Avoid using "Fix"; If you are fixing a bug, it is sufficient to
> describe the bug (“NullPointerException if user is unknown”) and people
> will correctly surmise that the purpose of your change is to fix the bug.
> S8. Do not add a contributor's name; the author tag is made exactly for
> this and can be explored/parsed much more efficiently by tools/people for
> stats or other purposes
> S9. Do not add reviewers name; information is present in multiple places
> (e.g., committer tag, PR, JIRA)
>
> Message body: (Trivial changes may not require a body)
> B1. Separate subject from body with a blank line
> B2. Wrap the body at 72 characters
> B3. Use the body to explain what and why vs. how
> Example
> "Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan and
> Converter to avoid executing the fallback method which in many cases
> returns null and can cause NPE when this value propagates up the call
> stack."
> vs.
> "Added handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
> and Converter"
> B4. If multiple authors include them using the standard GitHub marker,
> "Co-authored-by:", followed by the name and email of the author (e.g.,
> Co-authored-by: Marton Bod <mb...@cloudera.com>)
> B5. If the reviewer is different from committer (or merge via GitHub UI)
> use "Reviewed-by:" followed by the name and email of the reviewer (e.g.,
> Reviewed-by: Stamatis Zampetakis <za...@gmail.com>)
> B6. Use "Co-authored-by"/"Reviewed-by" on each own line and repeat as many
> times as authors/reviewers.
> B7. Include the PR id at the end of the message (e.g., Closes #2514);
> someone can easily navigate back to the PR to check comments, reviewers,
> etc.
>
> A sample commit message following these guidelines is shown below:
>
> commit de7781f29f82083fe01274b4d436b52920a89173
> Author: Soumyakanti Das <so...@gmail.com>
> Commit: Stamatis Zampetakis <za...@gmail.com>
>
>     HIVE-25354: NPE when estimating row count in external JDBC tables
>
>     Add handler methods in HiveRelMdDistictRowCount for JdbcHiveTableScan
>     and Converter to avoid executing the fallback method which in many
>     cases returns null and can cause NPE when this value propagates up the
>     call stack.
>
>     Co-authored-by: Krisztian Kasa <kk...@cloudera.com>
>
>     Reviewed-by: Peter Vary <pv...@cloudera.com>
>     Reviewed-by: Zoltan Haindrich <ki...@rxd.hu>
>
>     Closes #2514
>
> Let me know your thoughts.
>
> Best,
> Stamatis
>