You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-dev@apache.org by Jukka Zitting <ju...@gmail.com> on 2011/12/14 14:19:52 UTC

@apache.org commit address requirement (Was: Git hosting is go)

Hi,

The asfgit pre_receive hook currently checks that each commit being
pushed has its committer email (%ce) set to the @apache.org address of
one of committers of the respective project.

This seems too strict and makes it quite cumbersome to handle things
like pull requests where you'd want to take in incoming commits as-is
and just merge them to the main branch instead of rebasing or
rewriting the commits. We can't even rely on the %ce information being
correct since it's set on the client side.

Instead of this %ce check I'd leave the commit messages as-is and
rather augment the ref-updates.log with information about who pushed
each individual commit. That gives us a more reliable record for tying
individual changes to ICLAs than the %ce addresses in commit messages.

Alternatively, if we do want to include explicit and properly verified
ICLA references in the actual Git commit messages (instead of an
out-of-band mechanism like ref-updates.log), then I'd rather use
separate Signed-Off-By labels that the committer who pushes a set of
commits needs to add to each commit being pushed.

WDYT? I can make the changes if we can reach consensus on this.

[See below for the relevant context from callback-dev@]

BR,

Jukka Zitting

---------- Forwarded message ----------
From: Jukka Zitting <ju...@gmail.com>
Date: Wed, Dec 14, 2011 at 1:58 PM
Subject: Re: Git hosting is go
To: callback-dev@incubator.apache.org


Hi,


On Wed, Dec 14, 2011 at 12:05 PM, Jukka Zitting <ju...@gmail.com> wrote:
> I'm not sure how strict the backend is about commits pulled from
> elsewhere. Ideally it should only record the currently authenticated
> user or at most require a Signed-Off-By entry from the committer who's
> pushing the commits. I'll see what I can find out.

The backend check that the committer (not author) address of each
commit being pushed is an @apache.org address of one of the committers
authorized to push to the repository.

Checking the committer address seems way too strict as it would
require breaking history on all pull requests from outside the core
development group. I'll bring this up on infra-dev@ [1], to see how we
could best address the issue.

[1] http://mail-archives.apache.org/mod_mbox/www-infrastructure-dev/

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Owen O'Malley <om...@apache.org>.
On Wed, Dec 14, 2011 at 5:19 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Instead of this %ce check I'd leave the commit messages as-is and
> rather augment the ref-updates.log with information about who pushed
> each individual commit. That gives us a more reliable record for tying
> individual changes to ICLAs than the %ce addresses in commit messages.

This sounds like the right approach. Making the committer be in
*@apache.org is much too strict
on the one hand and doesn't provide the desired property of knowing
who actually did the push to the
central repo. The record of who actually did the push to apache
central repo is required I think.

-- Owen

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

How about, since we don't seem to have consensus on this, we make the
%ce check configurable on a per-project basis?

That way we can collect experience of how things work both with and
without this check. Then in say 3-6 months from now we look back and
see how things have turned out. If it then looks like the %ce check
(or any variant of it) is useful or even necessary, we can make a more
informed decision about mandating it on all projects.

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 14, 2011 at 4:29 PM, Brett Porter <br...@apache.org> wrote:
>
> On 15/12/2011, at 6:15 AM, Jukka Zitting wrote:
>
>> I'd treat those as social problems that are best solved by social
>> means. Introducing technical barriers will just force people to use
>> awkward workarounds when things like pull requests don't work like
>> they'd expect. After all, there are plenty of cases where people have
>> accidentally committed extra files or other garbage with Subversion. I
>> don't see much difference here.
>>
>> I wouldn't object to having the %ce check available as an option for
>> projects that want it or even enabled by default, but IMHO it
>> shouldn't be mandatory for all projects.
>
> I know for myself that I've far too often forgotten a git config on a new clone and accidentally put @apache.org in a work repo or vice-versa, and seen a few <ro...@someserver-name> crop up. I think the check is useful as a form of assistance, especially since you can't really fix it once it is pushed, and not always obvious when committing.
>
> Is there a way to rebase a commit so the author is retained but the committer is adjusted to the person approving it, so that these can be handled more easily?
>
> Worth noting - while I'm not sure if it is current, it appears that Eclipse require a commit line match [1] (and even go a bit stricter [2]).
>
> The ICLA/email link would be ideal... or perhaps there is a compromise solution in the mean time - for example, reading a list of email addresses from CONTRIBUTORS or pom.xml in the top of the git repository that are allowed to commit, which you'd need to update first.
>

This is a decent short term solution that I'd be willing to
investigate. Though I would require that it be OOB from the actual
repository content.

> Cheers,
> Brett
>
> [1] http://wiki.eclipse.org/Git#Committers_new_to_Git
> [2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=324768
>
> --
> Brett Porter
> brett@apache.org
> http://brettporter.wordpress.com/
> http://au.linkedin.com/in/brettporter
> http://twitter.com/brettporter
>
>
>
>
>

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brett Porter <br...@apache.org>.
On 15/12/2011, at 6:15 AM, Jukka Zitting wrote:

> I'd treat those as social problems that are best solved by social
> means. Introducing technical barriers will just force people to use
> awkward workarounds when things like pull requests don't work like
> they'd expect. After all, there are plenty of cases where people have
> accidentally committed extra files or other garbage with Subversion. I
> don't see much difference here.
> 
> I wouldn't object to having the %ce check available as an option for
> projects that want it or even enabled by default, but IMHO it
> shouldn't be mandatory for all projects.

I know for myself that I've far too often forgotten a git config on a new clone and accidentally put @apache.org in a work repo or vice-versa, and seen a few <ro...@someserver-name> crop up. I think the check is useful as a form of assistance, especially since you can't really fix it once it is pushed, and not always obvious when committing.

Is there a way to rebase a commit so the author is retained but the committer is adjusted to the person approving it, so that these can be handled more easily?

Worth noting - while I'm not sure if it is current, it appears that Eclipse require a commit line match [1] (and even go a bit stricter [2]).

The ICLA/email link would be ideal... or perhaps there is a compromise solution in the mean time - for example, reading a list of email addresses from CONTRIBUTORS or pom.xml in the top of the git repository that are allowed to commit, which you'd need to update first.

Cheers,
Brett

[1] http://wiki.eclipse.org/Git#Committers_new_to_Git
[2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=324768

--
Brett Porter
brett@apache.org
http://brettporter.wordpress.com/
http://au.linkedin.com/in/brettporter
http://twitter.com/brettporter






Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 4:36 AM, Simon Pepping <sa...@gmail.com> wrote:
> On Tue, Dec 20, 2011 at 00:15, Jukka Zitting <ju...@gmail.com>wrote:
>
>>
>> * So, instead of checking the %ce field, my proposal would be a check
>> that verifies that either the recorded author or at least one of the
>> (possible multiple) signers-off of an incoming git commit has a CLA on
>> file. And for this to only require rewriting of commits ("git commit
>> --amend -s") when actually needed, the check should be against all the
>> recorded CLAs (ideally with possibly multiple email addresses per CLA)
>> instead of just the committers of a project. Thus an Apache committer
>> would only need to modify an incoming commit if its author doesn't
>> already have a CLA on file.
>>
>> The author is the person contributing the code, and therefore the holder
> of the copyright on it, and therefore the person who can license it to the
> ASF. The committer is the accidental person who last added the commit to
> his repository in a way that rewrote the Committer field. His IP is none.
> Therefore the above proposal is better than applying rules to the Committer
> field.
>
> Simon

But the committer is the person who verified that the person had the
IP and hence is required to be making the gate keeper checks on if
that code is appropriate for inclusion. I'd be open to checking
either, but this goes back to Roy's point about not requiring extra
checks than we've historically had for SVN.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Simon Pepping <sa...@gmail.com>.
On Tue, Dec 20, 2011 at 00:15, Jukka Zitting <ju...@gmail.com>wrote:

>
> * So, instead of checking the %ce field, my proposal would be a check
> that verifies that either the recorded author or at least one of the
> (possible multiple) signers-off of an incoming git commit has a CLA on
> file. And for this to only require rewriting of commits ("git commit
> --amend -s") when actually needed, the check should be against all the
> recorded CLAs (ideally with possibly multiple email addresses per CLA)
> instead of just the committers of a project. Thus an Apache committer
> would only need to modify an incoming commit if its author doesn't
> already have a CLA on file.
>
> The author is the person contributing the code, and therefore the holder
of the copyright on it, and therefore the person who can license it to the
ASF. The committer is the accidental person who last added the commit to
his repository in a way that rewrote the Committer field. His IP is none.
Therefore the above proposal is better than applying rules to the Committer
field.

Simon

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brett Porter <br...@apache.org>.
On 21/12/2011, at 12:54 PM, Paul Davis wrote:

>> An observation I wanted to make from this thread: we seem to talk a lot about a group of people who have a CLA on file but are not committers (ASF sense of the word). It is great if we can make signing CLAs easy and making contributions with one easy. However, I believe that group of people should be small at any point in time - regular contributions should mean that person is on the path to becoming a committer.
>> 
> 
> The general sentiment as I understand it is that the number of people
> with an ICLA that aren't ASF committers will be expected to grow
> considerably. The general idea as I understand it is to make this step
> less painful so even less-than-regular contributors have an ICLA on
> file. It's also expected that this process will smooth the on-boarding
> process for new committers as well.

Less pain for those two things is certainly good, but not sure about the general sentiment. We don't want it to end up establishing an environment where that form of contribution becomes the norm. Every contribution is different, but in many cases projects need to encourage the authors to stick around and enhance or support the work, or expand their involvement.

Apologies for mixing in a slightly different subject, though - this is related to how projects use it, not what the tools need to enforce.

>> 
>> What I would find helpful is to list out the important scenarios, and how they'd play out with the different approaches, along with pros and cons. Is that something that is available from the CouchDB documentation?
>> 
> 
> There aren't any docs on this currently because we're still operating
> in compliance with historical guidelines on moving patches through
> JIRA. The current check is the very strict, "every %ce field must be
> an Apache CouchDB committer," version.
> 
> Sam has shown me a couple places to harvest email aliases and ICLA
> email addresses from so I plan on opening this up to any email address
> from those sources, with the future being an LDAP query (under the
> assumption that having an ICLA on file results in an LDAP entry).
> 

Understood, and this all sounds fine to me. 

I just meant if this thread continues, I'd find it easier to digest if we look at specific contribution/commit/push scenarios and what the pros/cons of different approaches are than the more general discussion being had now.

Cheers,
Brett

--
Brett Porter
brett@apache.org
http://brettporter.wordpress.com/
http://au.linkedin.com/in/brettporter
http://twitter.com/brettporter






Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 7:31 PM, Brett Porter <br...@apache.org> wrote:
> I have a couple of tangential points to this, had to guess what the right place to insert them into the discussion was...
>
> On 21/12/2011, at 7:04 AM, Paul Davis wrote:
>
>> The reflog is useful, but I don't think is the end of the story. For
>> instance, what happens if an ASF committer works on a branch, and then
>> a second ASF committer pushes those changes to master? The reflog
>> would tell us that the first ASF committer introduced the code into
>> the repository, but the second ASF committer is the person responsible
>> for that code eventually going into a release which is (from what I
>> understand) the more important of the two roles.
>>
>> At this point who gets blamed if something is wrong?
>
> The PMC are responsible for what goes into a release, not an individual. They establish social norms and use tools to make it easier to track the IP. I imagine who pushed a particular change to a particular branch is useful information to make available to the developers, but it seems separate to this discussion.
>
>>
>> Where as, if we instead check that the Git commiter (%ce) field is
>> person who should be cleared by a CLA then we don't have such
>> questions. And we also don't have to concern ourselves for when the
>> initial branch was developed outside of the canonical ASF repository
>> (ie, not in the reflog).
>
> An observation I wanted to make from this thread: we seem to talk a lot about a group of people who have a CLA on file but are not committers (ASF sense of the word). It is great if we can make signing CLAs easy and making contributions with one easy. However, I believe that group of people should be small at any point in time - regular contributions should mean that person is on the path to becoming a committer.
>

The general sentiment as I understand it is that the number of people
with an ICLA that aren't ASF committers will be expected to grow
considerably. The general idea as I understand it is to make this step
less painful so even less-than-regular contributors have an ICLA on
file. It's also expected that this process will smooth the on-boarding
process for new committers as well.

>>
>>> * So, instead of checking the %ce field, my proposal would be a check
>>> that verifies that either the recorded author or at least one of the
>>> (possible multiple) signers-off of an incoming git commit has a CLA on
>>> file. And for this to only require rewriting of commits ("git commit
>>> --amend -s") when actually needed, the check should be against all the
>>> recorded CLAs (ideally with possibly multiple email addresses per CLA)
>>> instead of just the committers of a project. Thus an Apache committer
>>> would only need to modify an incoming commit if its author doesn't
>>> already have a CLA on file.
>>>
>>
>> Minus the disagreement on the field we check, this is the goal I'm
>> going for. Anyone with a CLA on file can contribute without requiring
>> that the patch goes through JIRA. I'm open to revising which fields
>> are checked, but it still seems most obvious that the committer (%ce
>> field) is the person directly responsible for having cleared the code
>> and hence is what should be checked.
>
>
> I was also having trouble splitting the difference between the approaches. Signing off commits doesn't seem any easier than re-committing commits. It avoids altering the committer field, but the more this goes on it seems that field isn't useful information. On the other hand, checking it does provide a useful utility to projects and simplifies the description of roles in the project.
>
> Paul's approach still sounds reasonable to me, at least as the starting point.
>
> What I would find helpful is to list out the important scenarios, and how they'd play out with the different approaches, along with pros and cons. Is that something that is available from the CouchDB documentation?
>

There aren't any docs on this currently because we're still operating
in compliance with historical guidelines on moving patches through
JIRA. The current check is the very strict, "every %ce field must be
an Apache CouchDB committer," version.

Sam has shown me a couple places to harvest email aliases and ICLA
email addresses from so I plan on opening this up to any email address
from those sources, with the future being an LDAP query (under the
assumption that having an ICLA on file results in an LDAP entry).

> Cheers,
> Brett
>
> --
> Brett Porter
> brett@apache.org
> http://brettporter.wordpress.com/
> http://au.linkedin.com/in/brettporter
> http://twitter.com/brettporter
>
>
>
>
>

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brett Porter <br...@apache.org>.
I have a couple of tangential points to this, had to guess what the right place to insert them into the discussion was...

On 21/12/2011, at 7:04 AM, Paul Davis wrote:

> The reflog is useful, but I don't think is the end of the story. For
> instance, what happens if an ASF committer works on a branch, and then
> a second ASF committer pushes those changes to master? The reflog
> would tell us that the first ASF committer introduced the code into
> the repository, but the second ASF committer is the person responsible
> for that code eventually going into a release which is (from what I
> understand) the more important of the two roles.
> 
> At this point who gets blamed if something is wrong?

The PMC are responsible for what goes into a release, not an individual. They establish social norms and use tools to make it easier to track the IP. I imagine who pushed a particular change to a particular branch is useful information to make available to the developers, but it seems separate to this discussion.

> 
> Where as, if we instead check that the Git commiter (%ce) field is
> person who should be cleared by a CLA then we don't have such
> questions. And we also don't have to concern ourselves for when the
> initial branch was developed outside of the canonical ASF repository
> (ie, not in the reflog).

An observation I wanted to make from this thread: we seem to talk a lot about a group of people who have a CLA on file but are not committers (ASF sense of the word). It is great if we can make signing CLAs easy and making contributions with one easy. However, I believe that group of people should be small at any point in time - regular contributions should mean that person is on the path to becoming a committer.

> 
>> * So, instead of checking the %ce field, my proposal would be a check
>> that verifies that either the recorded author or at least one of the
>> (possible multiple) signers-off of an incoming git commit has a CLA on
>> file. And for this to only require rewriting of commits ("git commit
>> --amend -s") when actually needed, the check should be against all the
>> recorded CLAs (ideally with possibly multiple email addresses per CLA)
>> instead of just the committers of a project. Thus an Apache committer
>> would only need to modify an incoming commit if its author doesn't
>> already have a CLA on file.
>> 
> 
> Minus the disagreement on the field we check, this is the goal I'm
> going for. Anyone with a CLA on file can contribute without requiring
> that the patch goes through JIRA. I'm open to revising which fields
> are checked, but it still seems most obvious that the committer (%ce
> field) is the person directly responsible for having cleared the code
> and hence is what should be checked.


I was also having trouble splitting the difference between the approaches. Signing off commits doesn't seem any easier than re-committing commits. It avoids altering the committer field, but the more this goes on it seems that field isn't useful information. On the other hand, checking it does provide a useful utility to projects and simplifies the description of roles in the project. 

Paul's approach still sounds reasonable to me, at least as the starting point.

What I would find helpful is to list out the important scenarios, and how they'd play out with the different approaches, along with pros and cons. Is that something that is available from the CouchDB documentation?

Cheers,
Brett

--
Brett Porter
brett@apache.org
http://brettporter.wordpress.com/
http://au.linkedin.com/in/brettporter
http://twitter.com/brettporter






Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brian LeRoux <b...@brian.io>.
FWIW, this is how the PhoneGap/Cordova project has always worked.
We're very diligent about monitering the IP before drawing in new
work. Only a few ppl actually perform merges and none happen w/o
validation by at least two individuals (usually more). We coordinate
via the mailing list. Seem to me this very compatible w/ the Apache
Way but of course we are still learning.


On Tue, Dec 20, 2011 at 5:08 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 21, 2011 at 2:03 AM, David Jencks <da...@yahoo.com> wrote:
>> I'm really not understanding what the point of doing anything other than checking that
>> the person who pushes the work into the asf repository is an asf committer on the
>> project.  That's all we do for svn, right? We can do reports or whatever on the
>> additional git metadata but until there's a demonstrated problem I don't see why
>> we need to solve it.
>
> +1
>
> BR,
>
> Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Dec 21, 2011 at 2:03 AM, David Jencks <da...@yahoo.com> wrote:
> I'm really not understanding what the point of doing anything other than checking that
> the person who pushes the work into the asf repository is an asf committer on the
> project.  That's all we do for svn, right? We can do reports or whatever on the
> additional git metadata but until there's a demonstrated problem I don't see why
> we need to solve it.

+1

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 9:08 PM, David Jencks <da...@yahoo.com> wrote:
>
> On Dec 20, 2011, at 5:46 PM, Paul Davis wrote:
>
>> On Tue, Dec 20, 2011 at 7:03 PM, David Jencks <da...@yahoo.com> wrote:
>>>
>>> On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:
>>>
>>>> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
>>>> <je...@thomersonfamily.com> wrote:
>>>>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>>>>>
>>>>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>>>>> wrote:
>>> <giant snip>
>>>>
>>>> Once again I'm going to point out that current patches must move
>>>> through JIRA. Assuming people follow this policy then the %ce field is
>>>> by definition an ASF committer on the Apache project in question. Full
>>>> stop.
>>>>
>>>
>>> What exactly do you mean by "patch moves through jira" for git?
>>>
>>
>> Ie, the patch goes to JIRA, the ASF committer the downloads the patch
>> from JIRA, applies, reviews, and then if it checks out, finally pushes
>> it to master (or where ever is appropriate for integration based on
>> that project's workflows). Cassandra has scripts [1] that automate
>> most of this.
>>
>>> I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.
>>>
>>
>> I'm not sure how a pointer to some change set satisfies the policy.
>> The underlying motivation for submitting the patch to JIRA is to
>> indicate "I submit this code to be included under the ASL 2.0" which
>> doesn't seem to hold up if the code isn't actually attached to the
>> ticket with the little check box clicked.
>
> doesn't the hash identify the contribution well enough?  Are you worried about hash collisions?  Or that the external repo might disappear?  When does moving the work into apache git change the hash?
>

Not necessarily. No. Yes. Depends.

> If we're talking about other ways of applying work from outside asf to inside how do these concerns differ?  They seem totally unrelated to the concerns previously expressed in this thread.
>

I'm not sure what you're asking here so I can't comment.

> thanks
> david jencks
>
>
>>
>>> I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.
>>>
>>> thanks
>>> david jencks
>>>
>>>
>>
>> Git is not the same as SVN. This is specifically dealing with the
>> distributed nature of Git and how we can deal with enforcing
>> constraints on code uploaded to ASF canonical repos that may have come
>> from anywhere. As I've tried to point out if we're just going to
>> maintain the same policies we have for SVN then this is all moot
>> because patches would have to be applied by hand instead being pulled
>> from arbitrary remote repos. The entire motivation for having these
>> checks is to maintain the provenance for contributions without
>> requiring that every patch moves through JIRA.
>>
>> [1] https://github.com/eevans/git-jira-attacher/
>

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by David Jencks <da...@yahoo.com>.
On Dec 20, 2011, at 5:46 PM, Paul Davis wrote:

> On Tue, Dec 20, 2011 at 7:03 PM, David Jencks <da...@yahoo.com> wrote:
>> 
>> On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:
>> 
>>> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
>>> <je...@thomersonfamily.com> wrote:
>>>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>>>> 
>>>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>>>> wrote:
>> <giant snip>
>>> 
>>> Once again I'm going to point out that current patches must move
>>> through JIRA. Assuming people follow this policy then the %ce field is
>>> by definition an ASF committer on the Apache project in question. Full
>>> stop.
>>> 
>> 
>> What exactly do you mean by "patch moves through jira" for git?
>> 
> 
> Ie, the patch goes to JIRA, the ASF committer the downloads the patch
> from JIRA, applies, reviews, and then if it checks out, finally pushes
> it to master (or where ever is appropriate for integration based on
> that project's workflows). Cassandra has scripts [1] that automate
> most of this.
> 
>> I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.
>> 
> 
> I'm not sure how a pointer to some change set satisfies the policy.
> The underlying motivation for submitting the patch to JIRA is to
> indicate "I submit this code to be included under the ASL 2.0" which
> doesn't seem to hold up if the code isn't actually attached to the
> ticket with the little check box clicked.

doesn't the hash identify the contribution well enough?  Are you worried about hash collisions?  Or that the external repo might disappear?  When does moving the work into apache git change the hash?

If we're talking about other ways of applying work from outside asf to inside how do these concerns differ?  They seem totally unrelated to the concerns previously expressed in this thread.

thanks
david jencks


> 
>> I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.
>> 
>> thanks
>> david jencks
>> 
>> 
> 
> Git is not the same as SVN. This is specifically dealing with the
> distributed nature of Git and how we can deal with enforcing
> constraints on code uploaded to ASF canonical repos that may have come
> from anywhere. As I've tried to point out if we're just going to
> maintain the same policies we have for SVN then this is all moot
> because patches would have to be applied by hand instead being pulled
> from arbitrary remote repos. The entire motivation for having these
> checks is to maintain the provenance for contributions without
> requiring that every patch moves through JIRA.
> 
> [1] https://github.com/eevans/git-jira-attacher/


Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 9:57 PM, Roy T. Fielding <fi...@gbiv.com> wrote:
> On Dec 20, 2011, at 5:46 PM, Paul Davis wrote:
>
>> On Tue, Dec 20, 2011 at 7:03 PM, David Jencks <da...@yahoo.com> wrote:
>>>
>>> On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:
>>>
>>>> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
>>>> <je...@thomersonfamily.com> wrote:
>>>>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>>>>>
>>>>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>>>>> wrote:
>>> <giant snip>
>>>>
>>>> Once again I'm going to point out that current patches must move
>>>> through JIRA. Assuming people follow this policy then the %ce field is
>>>> by definition an ASF committer on the Apache project in question. Full
>>>> stop.
>>>>
>>>
>>> What exactly do you mean by "patch moves through jira" for git?
>>
>> Ie, the patch goes to JIRA, the ASF committer the downloads the patch
>> from JIRA, applies, reviews, and then if it checks out, finally pushes
>> it to master (or where ever is appropriate for integration based on
>> that project's workflows). Cassandra has scripts [1] that automate
>> most of this.
>
> Apache projects are not required to use Jira.  Contributions can
> be contributed using any of our communication forums and they are
> considered to be under the Apache License 2.0.  If the author happens
> to have a CLA on file, then the CLA overrides the normal contribution
> license automatically -- there is no need to check that.
>
> There is no reason to apply this extra level of control within
> infrastructure for checking things that any reasonably competent
> committer can be trusted to do themselves.  And there is a known
> reason not to do so, namely that the committer field in git has
> nothing to do with the provenance of the code, but may in fact
> vary for the same individual depending on whether they are
> interacting with a public repository or their work's repository,
> or maybe even their club's repository.  Github is certainly one
> example where the committer names will not match our avail names,
> and one of the goals of this effort is to enable folks to
> use Github as one of many forums for collaborating with potential
> recruits.
>
> Yes, that opinion comes from me speaking as a board member and
> author of the Apache License, and has previously been cleared
> with Apache's legal team for a long ago discussion with Incubator.
> We don't need a CLA on file to accept contributions from non-committers.
> We just need a clear intent by the author to contribute under
> our normal terms.
>

Noted.

>>> I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.
>>>
>>
>> I'm not sure how a pointer to some change set satisfies the policy.
>> The underlying motivation for submitting the patch to JIRA is to
>> indicate "I submit this code to be included under the ASL 2.0" which
>> doesn't seem to hold up if the code isn't actually attached to the
>> ticket with the little check box clicked.
>
> We have archives on all of our communication channels.  We don't
> need the silly checkbox.  We never have.
>

The issue I was trying to address with this is that pulling from
GitHub doesn't go through any communication channel that indicates the
contribution was intentional.

>>> I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.
>>>
>>> thanks
>>> david jencks
>>>
>>>
>>
>> Git is not the same as SVN. This is specifically dealing with the
>> distributed nature of Git and how we can deal with enforcing
>> constraints on code uploaded to ASF canonical repos that may have come
>> from anywhere. As I've tried to point out if we're just going to
>> maintain the same policies we have for SVN then this is all moot
>> because patches would have to be applied by hand instead being pulled
>> from arbitrary remote repos.
>
> No, they wouldn't.  What makes you think that I can't implement a
> tool to apply changes found on a forked subversion instances?
> I certainly have the right to do so as an Apache committer.
> That is a common operating procedure for some of our projects
> where we have a commercially-supported fork in house.
> I am trusted to commit to the ASF repository only those changes
> that are intended for contribution to Apache.  Git just includes
> those tools for us and standardizes the process/identifiers.
>
>> The entire motivation for having these
>> checks is to maintain the provenance for contributions without
>> requiring that every patch moves through JIRA.
>
> Again, there is no such requirement for commits/pushes at Apache.
> The person responsible for moving the bits into our repository
> is responsible for verifying that they have the right to do so
> before the push is made.  The authors do not need to have a CLA
> on file even if the contribution is massive -- CLAs are only
> required for the people who want an account at Apache and thus
> are allowed to make the decision to push those bits into our
> repository.
>

This contradicts basically every policy I've ever heard on accepting
contributions as a committer. It's quite possible that I've completely
misunderstood everything I've been told. Although everyone I've talked
to (small sample) has expressed the same confused reaction to how this
doesn't jive with policy as taught by the incubator.

> The CLA has requirements on submitting third-party contributions
> that must be adhered to when pushing stuff to Apache.  We expect
> that those requirements will be satisfied by the commit log.
> As it turns out, that is best accomplished by ensuring that the
> original committer and author identifiers remain those of the
> original author and not that of the pusher, even if it is the
> same person (with different IDs on different repos).  If not,
> the pusher needs to change the log in order to add a
> "Submitted-by: ..." note and whatever else needs to be said in
> accordance with the CLA.  This is independent of how the
> contribution is originally submitted to Apache, and it is the
> PMC's responsibility to ensure all its committers do so when
> appropriate.
>
> ....Roy

Just a note, the author fields are about the only thing you can make
reasonable guarantees won't change. And even then that's if committers
take care to make sure it doesn't change in some specific
circumstances that anyone not familiar with Git will be prone to no
knowing.

Also, specifically adding a "Submitted-by:" field (signed-off-by?)
will change everything that's not the author field. The entire point
of Signed-off-by was to record this information because the sha and
committer information will change as patches are emailed around. I'd
also strongly caution about forming any policy that resolves around
commit sha's and committer information not changing.

That said, you've made it clear that the board's position is that
these checks are unnecessary so I'll remove them.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 21, 2011 at 2:14 AM, Roy T. Fielding <fi...@gbiv.com> wrote:
> On Dec 20, 2011, at 11:05 PM, Brett Porter wrote:
>
>> [snip a whole lot of very useful information about contributions, CLAs and the license that I agree with Roy about]
>>
>> On 21/12/2011, at 2:57 PM, Roy T. Fielding wrote:
>>
>>> The CLA has requirements on submitting third-party contributions
>>> that must be adhered to when pushing stuff to Apache.  We expect
>>> that those requirements will be satisfied by the commit log.
>>> As it turns out, that is best accomplished by ensuring that the
>>> original committer and author identifiers remain those of the
>>> original author and not that of the pusher, even if it is the
>>> same person (with different IDs on different repos).  If not,
>>> the pusher needs to change the log in order to add a
>>> "Submitted-by: ..." note and whatever else needs to be said in
>>> accordance with the CLA.  This is independent of how the
>>> contribution is originally submitted to Apache, and it is the
>>> PMC's responsibility to ensure all its committers do so when
>>> appropriate.
>>
>>
>> TL;DR summary: given that, I think the check is still useful (summary & question at the end if you don't wish to educate me on git)
>>
>> I am a relative newbie to git, so I may just need someone to explain something to me more.
>>
>> In some situations Roy has suggested changing the commit log would be required. My understanding is that requires something like:
>>
>> $ git rebase -i
>>
>> This will also change the "Committer" field, leaving "Author" intact.
>>
>> Jukka has suggested "Signed-off-by:", which I understand can be done with:
>>
>> $ git --amend -s
>>
>> This will also change the "Committer" field, leaving "Author" intact, making the change fairly pointless for our purposes. It can't be done easily to commits > 1 deep in the history.
>>
>> Unless I'm missing something, any change to the commit logs means Committer is made to be your own address. Is that right?
>
> I believe so, yes, since the log is included in the changeset hash.
>
>> If so, we are only talking about examples where the commit log remains 100% intact.
>>
>> Working through an example:
>>
>> A pull request referred to by Jukka: https://github.com/callback/callback-ios/pull/36
>> Contains a commit: https://github.com/forcedotcom/callback-ios/commit/1c3a5b3d70ede5853f1cae35a6159d48192685cb
>>
>> -----
>> commit 1c3a5b3d70ede5853f1cae35a6159d48192685cb
>> Author:     Todd Stellanova <ts...@salesforce.com>
>> AuthorDate: Fri Nov 11 14:20:52 2011 -0800
>> Commit:     Todd Stellanova <ts...@salesforce.com>
>> CommitDate: Fri Nov 11 14:20:52 2011 -0800
>>
>>    refractor PhoneGapDelegate to allow teardown and reinit of web view
>> ------
>>
>> My understanding is that doing this is normal:
>> $ git fetch forcedotcom
>> $ git merge forcedotcom/master
>> $ git push origin master
>>
>> That would leave the commit message above intact.
>>
>> As a project member, that information seems unhelpful, as we need to go to another source (apparently from the git server ref-update.log) to find out which ASF committer actually put it into the tree. The mail notification may indicate that (I haven't checked what we do there), but it's still detaching some important information from the source control.
>
> I think that is far less harmful than causing an unnecessary change
> to the hash.
>
>> Adjusting the committer ID is easy:
>> $ git rebase --no-ff
>> (or if just a few need to be handled, git rebase -i)
>>
>> This does mean all the commit IDs get changed (same as if you have to edit a commit log above, or you alter the commit stream it in some other way). But the downstream user can still work with that, either with:
>>
>> $ git merge origin/trunk  (some extra commits in the local repository)
>> $ git rebase origin/trunk (catch up with origin again)
>>
>> There is the case of someone different from the author committing to the other repository: we would overwrite some information in this case. I can't, however, think of a use case where we care about that information. The author is important, and the ASF committer is important - in a trade off, I would take the ASF committer information over the some-other-repo committer.
>>
>> Summary - I see the committer check with the following pros:
>> - ensure commits contain information about which ASF committer was involved (not for provenance, but for project members to know what is going on with a commit)
>> - prevent accidentally committing rubbish information (default server email address, work email address you didn't mean to publish on the web)
>> And cons:
>> - inconvenience of rebasing
>> - losing track of committers to other repositories that are not the author.
>>
>> I would presume "committer check" means allowing any name & email that they elect to use, as Paul has described. I think it also means not requiring rebasing of commits from other ASF committers.
>>
>> On the weight of that, I would find the check worthwhile. I also see no advantage in replacing it by a signed-off-by check.
>>
>> Do others disagree with this conclusion but agree on the data points, or have I missed some data, pros or cons?
>
> I agree with everything except "ensure commits contain information about
> which ASF committer was involved (not for provenance, but for project members
> to know what is going on with a commit)".  I think that could be accomplished
> simply by using the pusher's Apache ID in the email notification that is sent
> to the commits list.
>
> I wish that Git allowed appends to the commit log, but it doesn't, and there
> is significant value in not changing the patch identifiers (the hash). In
> fact, I'd argue that changing the commit hash is the primary con, because
> it is the one thing that truly ties the pushed content to the contributor's
> intent to contribute.
>
> IOW, if I receive a pull request or changeset on a public list from a
> contributor that does not have a CLA on file, I want to preserve that
> changeset exactly as contributed so that the commit log will automatically
> refer to that third-party and the hash can be used to search for their
> contribution request.  I would only want to modify the changeset
> and provide a new log entry for those few cases where the existing
> information is not sufficient to satisfy my own CLA. [And, for that
> case, we should make it a good practice to include the original changeset
> identifier in the new log entry.]
>
> Regardless, the committer information is rarely shown by git -- you have
> to do something like
>
>  git log --pretty=fuller
>
> just to see it.  The commit hash is far more valuable information.
>
> ....Roy
>

I'd like to reiterate that the commit hash is not a unique identifier
for a commit. Its an identifier for a commit in relation to all
commits that precede it. It should in no way be considered a stable
identifier until it's on a stable branch which is by definition after
the contribution has been accepted.

I'd strongly recommend reconsidering any sort of policy or
recommendation that treats these hashes as special.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 21, 2011 at 12:19 PM, Jeremy Thomerson
<je...@thomersonfamily.com> wrote:
>
>
> On Wed, Dec 21, 2011 at 1:13 PM, Paul Davis <pa...@gmail.com>
> wrote:
>>
>> On Wed, Dec 21, 2011 at 7:35 AM, Jeremy Thomerson
>> <je...@thomersonfamily.com> wrote:
>> > On Wed, Dec 21, 2011 at 5:12 AM, Brett Porter <br...@apache.org> wrote:
>> >
>> >> > IOW, if I receive a pull request or changeset on a public list from a
>> >> > contributor that does not have a CLA on file, I want to preserve that
>> >> > changeset exactly as contributed so that the commit log will
>> >> automatically
>> >> > refer to that third-party and the hash can be used to search for
>> >> > their
>> >> > contribution request.  I would only want to modify the changeset
>> >> > and provide a new log entry for those few cases where the existing
>> >> > information is not sufficient to satisfy my own CLA. [And, for that
>> >> > case, we should make it a good practice to include the original
>> >> > changeset
>> >> > identifier in the new log entry.]
>> >>
>> >> Thanks - that's a much clearer justification than keeping the committer
>> >> intact, which I hadn't gleaned from the earlier conversation.
>> >>
>> >> My limited experience with external contributions has been mixed
>> >> between
>> >> pull requests being merged and rebased, and I got the impression
>> >> rebasing
>> >> was used more often than perhaps it is. I also don't tend to maintain a
>> >> repo that diverges from the "central" one, so I'm not too concerned
>> >> about
>> >> the commit IDs on mine.
>> >
>> >
>> > Roy's description above is the exact reason that I am in favor of
>> > removing
>> > the "push hook" (or making it optional per-project).  I would like to
>> > see a
>> > future workflow where:
>> >
>> > 1 - someone emails our dev list saying "please merge commit ABC123 from
>> > my
>> > repo http://example.org/...
>> > - or -
>> > 1 - someone creates a pull request on GH, and this gets emailed to our
>> > dev
>> > list
>> >
>> > 2 - we merge their commits in unchanged
>> >
>> > By doing so, we have clear "intent to contribute" with the hash of their
>> > commit(s).
>> >
>> > Additionally, there's a huge benefit to keeping the commit IDs the same.
>> >  I
>> > just tested this with my own playground repo.  Someone created a GH pull
>> > request [1].  I did *not* use the GH interface to close it.  Instead I
>> > manually added their repo and merged their commits (by commit ID).  Then
>> > I
>> > pushed to my repo.  The pull request automatically closed.  This means
>> > that
>> > we can fully integrate ASF projects with GH pull requests without having
>> > to
>> > use the GH UI or use it as canonical, etc.  And, we have very clear
>> > "intent
>> > to contribute" because if they create a pull request on GH, they are
>> > explicitly asking us to contribute them.  We'll want those archived on
>> > our
>> > hardware (mailing list message would be sufficient), but it's a very
>> > easy
>> > flow that involves little in the way of integration.  I have some ideas
>> > to
>> > make this a very simple process for our committers - but I'll leave that
>> > for another thread.
>> >
>> > So, can we please remove the current restriction or make it a
>> > configuration
>> > option?
>> >
>> > [1] https://github.com/jthomerson/github-playground/pull/1
>> >
>> > Jeremy
>>
>> Yup. Removed:
>>
>>
>> https://git-wip-us.apache.org/repos/infra?p=asfgit-admin.git;a=commitdiff;h=85c6f83f832fc4a13315b1225d27038215db63d5
>
>
> Thanks Paul!  When you commit a change there does it automatically get
> pushed to the actual git hosting?  Or do you have to do that manually?
>
> I'd still like to have a better community consensus around the understanding
> of CLA requirements, but I don't think that needs to (continue) happen(ing)
> on the infra-dev list.
>
> Jeremy Thomerson

Its self hosting. Its live as soon as the push finishes.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jeremy Thomerson <je...@thomersonfamily.com>.
On Wed, Dec 21, 2011 at 1:13 PM, Paul Davis <pa...@gmail.com>wrote:

> On Wed, Dec 21, 2011 at 7:35 AM, Jeremy Thomerson
> <je...@thomersonfamily.com> wrote:
> > On Wed, Dec 21, 2011 at 5:12 AM, Brett Porter <br...@apache.org> wrote:
> >
> >> > IOW, if I receive a pull request or changeset on a public list from a
> >> > contributor that does not have a CLA on file, I want to preserve that
> >> > changeset exactly as contributed so that the commit log will
> >> automatically
> >> > refer to that third-party and the hash can be used to search for their
> >> > contribution request.  I would only want to modify the changeset
> >> > and provide a new log entry for those few cases where the existing
> >> > information is not sufficient to satisfy my own CLA. [And, for that
> >> > case, we should make it a good practice to include the original
> changeset
> >> > identifier in the new log entry.]
> >>
> >> Thanks - that's a much clearer justification than keeping the committer
> >> intact, which I hadn't gleaned from the earlier conversation.
> >>
> >> My limited experience with external contributions has been mixed between
> >> pull requests being merged and rebased, and I got the impression
> rebasing
> >> was used more often than perhaps it is. I also don't tend to maintain a
> >> repo that diverges from the "central" one, so I'm not too concerned
> about
> >> the commit IDs on mine.
> >
> >
> > Roy's description above is the exact reason that I am in favor of
> removing
> > the "push hook" (or making it optional per-project).  I would like to
> see a
> > future workflow where:
> >
> > 1 - someone emails our dev list saying "please merge commit ABC123 from
> my
> > repo http://example.org/...
> > - or -
> > 1 - someone creates a pull request on GH, and this gets emailed to our
> dev
> > list
> >
> > 2 - we merge their commits in unchanged
> >
> > By doing so, we have clear "intent to contribute" with the hash of their
> > commit(s).
> >
> > Additionally, there's a huge benefit to keeping the commit IDs the same.
>  I
> > just tested this with my own playground repo.  Someone created a GH pull
> > request [1].  I did *not* use the GH interface to close it.  Instead I
> > manually added their repo and merged their commits (by commit ID).  Then
> I
> > pushed to my repo.  The pull request automatically closed.  This means
> that
> > we can fully integrate ASF projects with GH pull requests without having
> to
> > use the GH UI or use it as canonical, etc.  And, we have very clear
> "intent
> > to contribute" because if they create a pull request on GH, they are
> > explicitly asking us to contribute them.  We'll want those archived on
> our
> > hardware (mailing list message would be sufficient), but it's a very easy
> > flow that involves little in the way of integration.  I have some ideas
> to
> > make this a very simple process for our committers - but I'll leave that
> > for another thread.
> >
> > So, can we please remove the current restriction or make it a
> configuration
> > option?
> >
> > [1] https://github.com/jthomerson/github-playground/pull/1
> >
> > Jeremy
>
> Yup. Removed:
>
>
> https://git-wip-us.apache.org/repos/infra?p=asfgit-admin.git;a=commitdiff;h=85c6f83f832fc4a13315b1225d27038215db63d5
>

Thanks Paul!  When you commit a change there does it automatically get
pushed to the actual git hosting?  Or do you have to do that manually?

I'd still like to have a better community consensus around the
understanding of CLA requirements, but I don't think that needs to
(continue) happen(ing) on the infra-dev list.

Jeremy Thomerson

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 21, 2011 at 7:35 AM, Jeremy Thomerson
<je...@thomersonfamily.com> wrote:
> On Wed, Dec 21, 2011 at 5:12 AM, Brett Porter <br...@apache.org> wrote:
>
>> > IOW, if I receive a pull request or changeset on a public list from a
>> > contributor that does not have a CLA on file, I want to preserve that
>> > changeset exactly as contributed so that the commit log will
>> automatically
>> > refer to that third-party and the hash can be used to search for their
>> > contribution request.  I would only want to modify the changeset
>> > and provide a new log entry for those few cases where the existing
>> > information is not sufficient to satisfy my own CLA. [And, for that
>> > case, we should make it a good practice to include the original changeset
>> > identifier in the new log entry.]
>>
>> Thanks - that's a much clearer justification than keeping the committer
>> intact, which I hadn't gleaned from the earlier conversation.
>>
>> My limited experience with external contributions has been mixed between
>> pull requests being merged and rebased, and I got the impression rebasing
>> was used more often than perhaps it is. I also don't tend to maintain a
>> repo that diverges from the "central" one, so I'm not too concerned about
>> the commit IDs on mine.
>
>
> Roy's description above is the exact reason that I am in favor of removing
> the "push hook" (or making it optional per-project).  I would like to see a
> future workflow where:
>
> 1 - someone emails our dev list saying "please merge commit ABC123 from my
> repo http://example.org/...
> - or -
> 1 - someone creates a pull request on GH, and this gets emailed to our dev
> list
>
> 2 - we merge their commits in unchanged
>
> By doing so, we have clear "intent to contribute" with the hash of their
> commit(s).
>
> Additionally, there's a huge benefit to keeping the commit IDs the same.  I
> just tested this with my own playground repo.  Someone created a GH pull
> request [1].  I did *not* use the GH interface to close it.  Instead I
> manually added their repo and merged their commits (by commit ID).  Then I
> pushed to my repo.  The pull request automatically closed.  This means that
> we can fully integrate ASF projects with GH pull requests without having to
> use the GH UI or use it as canonical, etc.  And, we have very clear "intent
> to contribute" because if they create a pull request on GH, they are
> explicitly asking us to contribute them.  We'll want those archived on our
> hardware (mailing list message would be sufficient), but it's a very easy
> flow that involves little in the way of integration.  I have some ideas to
> make this a very simple process for our committers - but I'll leave that
> for another thread.
>
> So, can we please remove the current restriction or make it a configuration
> option?
>
> [1] https://github.com/jthomerson/github-playground/pull/1
>
> Jeremy

Yup. Removed:

https://git-wip-us.apache.org/repos/infra?p=asfgit-admin.git;a=commitdiff;h=85c6f83f832fc4a13315b1225d27038215db63d5

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jeremy Thomerson <je...@thomersonfamily.com>.
On Wed, Dec 21, 2011 at 5:12 AM, Brett Porter <br...@apache.org> wrote:

> > IOW, if I receive a pull request or changeset on a public list from a
> > contributor that does not have a CLA on file, I want to preserve that
> > changeset exactly as contributed so that the commit log will
> automatically
> > refer to that third-party and the hash can be used to search for their
> > contribution request.  I would only want to modify the changeset
> > and provide a new log entry for those few cases where the existing
> > information is not sufficient to satisfy my own CLA. [And, for that
> > case, we should make it a good practice to include the original changeset
> > identifier in the new log entry.]
>
> Thanks - that's a much clearer justification than keeping the committer
> intact, which I hadn't gleaned from the earlier conversation.
>
> My limited experience with external contributions has been mixed between
> pull requests being merged and rebased, and I got the impression rebasing
> was used more often than perhaps it is. I also don't tend to maintain a
> repo that diverges from the "central" one, so I'm not too concerned about
> the commit IDs on mine.


Roy's description above is the exact reason that I am in favor of removing
the "push hook" (or making it optional per-project).  I would like to see a
future workflow where:

1 - someone emails our dev list saying "please merge commit ABC123 from my
repo http://example.org/...
- or -
1 - someone creates a pull request on GH, and this gets emailed to our dev
list

2 - we merge their commits in unchanged

By doing so, we have clear "intent to contribute" with the hash of their
commit(s).

Additionally, there's a huge benefit to keeping the commit IDs the same.  I
just tested this with my own playground repo.  Someone created a GH pull
request [1].  I did *not* use the GH interface to close it.  Instead I
manually added their repo and merged their commits (by commit ID).  Then I
pushed to my repo.  The pull request automatically closed.  This means that
we can fully integrate ASF projects with GH pull requests without having to
use the GH UI or use it as canonical, etc.  And, we have very clear "intent
to contribute" because if they create a pull request on GH, they are
explicitly asking us to contribute them.  We'll want those archived on our
hardware (mailing list message would be sufficient), but it's a very easy
flow that involves little in the way of integration.  I have some ideas to
make this a very simple process for our committers - but I'll leave that
for another thread.

So, can we please remove the current restriction or make it a configuration
option?

[1] https://github.com/jthomerson/github-playground/pull/1

Jeremy

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brett Porter <br...@apache.org>.
On 21/12/2011, at 7:14 PM, Roy T. Fielding wrote:

> 
> I agree with everything except "ensure commits contain information about
> which ASF committer was involved (not for provenance, but for project members
> to know what is going on with a commit)".  I think that could be accomplished
> simply by using the pusher's Apache ID in the email notification that is sent
> to the commits list.
> 
> I wish that Git allowed appends to the commit log, but it doesn't, and there
> is significant value in not changing the patch identifiers (the hash). In
> fact, I'd argue that changing the commit hash is the primary con, because
> it is the one thing that truly ties the pushed content to the contributor's
> intent to contribute.
> 
> IOW, if I receive a pull request or changeset on a public list from a
> contributor that does not have a CLA on file, I want to preserve that
> changeset exactly as contributed so that the commit log will automatically
> refer to that third-party and the hash can be used to search for their
> contribution request.  I would only want to modify the changeset
> and provide a new log entry for those few cases where the existing
> information is not sufficient to satisfy my own CLA. [And, for that
> case, we should make it a good practice to include the original changeset
> identifier in the new log entry.]

Thanks - that's a much clearer justification than keeping the committer intact, which I hadn't gleaned from the earlier conversation.

My limited experience with external contributions has been mixed between pull requests being merged and rebased, and I got the impression rebasing was used more often than perhaps it is. I also don't tend to maintain a repo that diverges from the "central" one, so I'm not too concerned about the commit IDs on mine.

> 
> Regardless, the committer information is rarely shown by git -- you have
> to do something like
> 
>  git log --pretty=fuller
> 
> just to see it.  The commit hash is far more valuable information.

Right, but other tools are showing it. I was hoping for something like these:
https://github.com/brettporter/centrepoint/commit/6df5bf435e3a90ec4e036c19ad6cdfee2f3906fa
https://git-wip-us.apache.org/repos/asf?p=couchdb.git;a=commit;h=93b7f34bca748284c65eb2d471b33628a966aa0f

If it can be found by mail or some other web tool provided by infra by searching for the hash, then I agree that's sufficient even if not ideal.

The only other mechanism I can think of to attach it to the history would be to force a merge commit, but I don't like that option any better and it doesn't seem any more easily navigable.

Is there still some value in validating the committer emails against a project managed list (the CONTRIBUTORS/pom.xml solution I mentioned earlier - which can automatically include anyone with an ICLA and perhaps be optional per project), as a means of preventing accidents? The nature of these being a little hidden in some git processes has bitten myself and others on a number of occasions.

Cheers,
Brett

--
Brett Porter
brett@apache.org
http://brettporter.wordpress.com/
http://au.linkedin.com/in/brettporter
http://twitter.com/brettporter






Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by "Roy T. Fielding" <fi...@gbiv.com>.
On Dec 20, 2011, at 11:05 PM, Brett Porter wrote:

> [snip a whole lot of very useful information about contributions, CLAs and the license that I agree with Roy about]
> 
> On 21/12/2011, at 2:57 PM, Roy T. Fielding wrote:
> 
>> The CLA has requirements on submitting third-party contributions
>> that must be adhered to when pushing stuff to Apache.  We expect
>> that those requirements will be satisfied by the commit log.
>> As it turns out, that is best accomplished by ensuring that the
>> original committer and author identifiers remain those of the
>> original author and not that of the pusher, even if it is the
>> same person (with different IDs on different repos).  If not,
>> the pusher needs to change the log in order to add a
>> "Submitted-by: ..." note and whatever else needs to be said in
>> accordance with the CLA.  This is independent of how the
>> contribution is originally submitted to Apache, and it is the
>> PMC's responsibility to ensure all its committers do so when
>> appropriate.
> 
> 
> TL;DR summary: given that, I think the check is still useful (summary & question at the end if you don't wish to educate me on git)
> 
> I am a relative newbie to git, so I may just need someone to explain something to me more.
> 
> In some situations Roy has suggested changing the commit log would be required. My understanding is that requires something like:
> 
> $ git rebase -i
> 
> This will also change the "Committer" field, leaving "Author" intact.
> 
> Jukka has suggested "Signed-off-by:", which I understand can be done with:
> 
> $ git --amend -s
> 
> This will also change the "Committer" field, leaving "Author" intact, making the change fairly pointless for our purposes. It can't be done easily to commits > 1 deep in the history.
> 
> Unless I'm missing something, any change to the commit logs means Committer is made to be your own address. Is that right?

I believe so, yes, since the log is included in the changeset hash.

> If so, we are only talking about examples where the commit log remains 100% intact.
> 
> Working through an example:
> 
> A pull request referred to by Jukka: https://github.com/callback/callback-ios/pull/36
> Contains a commit: https://github.com/forcedotcom/callback-ios/commit/1c3a5b3d70ede5853f1cae35a6159d48192685cb
> 
> -----
> commit 1c3a5b3d70ede5853f1cae35a6159d48192685cb
> Author:     Todd Stellanova <ts...@salesforce.com>
> AuthorDate: Fri Nov 11 14:20:52 2011 -0800
> Commit:     Todd Stellanova <ts...@salesforce.com>
> CommitDate: Fri Nov 11 14:20:52 2011 -0800
> 
>    refractor PhoneGapDelegate to allow teardown and reinit of web view
> ------
> 
> My understanding is that doing this is normal:
> $ git fetch forcedotcom
> $ git merge forcedotcom/master
> $ git push origin master
> 
> That would leave the commit message above intact.
> 
> As a project member, that information seems unhelpful, as we need to go to another source (apparently from the git server ref-update.log) to find out which ASF committer actually put it into the tree. The mail notification may indicate that (I haven't checked what we do there), but it's still detaching some important information from the source control.

I think that is far less harmful than causing an unnecessary change
to the hash.

> Adjusting the committer ID is easy:
> $ git rebase --no-ff
> (or if just a few need to be handled, git rebase -i)
> 
> This does mean all the commit IDs get changed (same as if you have to edit a commit log above, or you alter the commit stream it in some other way). But the downstream user can still work with that, either with:
> 
> $ git merge origin/trunk  (some extra commits in the local repository)
> $ git rebase origin/trunk (catch up with origin again)
> 
> There is the case of someone different from the author committing to the other repository: we would overwrite some information in this case. I can't, however, think of a use case where we care about that information. The author is important, and the ASF committer is important - in a trade off, I would take the ASF committer information over the some-other-repo committer.
> 
> Summary - I see the committer check with the following pros:
> - ensure commits contain information about which ASF committer was involved (not for provenance, but for project members to know what is going on with a commit)
> - prevent accidentally committing rubbish information (default server email address, work email address you didn't mean to publish on the web)
> And cons:
> - inconvenience of rebasing
> - losing track of committers to other repositories that are not the author.
> 
> I would presume "committer check" means allowing any name & email that they elect to use, as Paul has described. I think it also means not requiring rebasing of commits from other ASF committers.
> 
> On the weight of that, I would find the check worthwhile. I also see no advantage in replacing it by a signed-off-by check.
> 
> Do others disagree with this conclusion but agree on the data points, or have I missed some data, pros or cons?

I agree with everything except "ensure commits contain information about
which ASF committer was involved (not for provenance, but for project members
to know what is going on with a commit)".  I think that could be accomplished
simply by using the pusher's Apache ID in the email notification that is sent
to the commits list.

I wish that Git allowed appends to the commit log, but it doesn't, and there
is significant value in not changing the patch identifiers (the hash). In
fact, I'd argue that changing the commit hash is the primary con, because
it is the one thing that truly ties the pushed content to the contributor's
intent to contribute.

IOW, if I receive a pull request or changeset on a public list from a
contributor that does not have a CLA on file, I want to preserve that
changeset exactly as contributed so that the commit log will automatically
refer to that third-party and the hash can be used to search for their
contribution request.  I would only want to modify the changeset
and provide a new log entry for those few cases where the existing
information is not sufficient to satisfy my own CLA. [And, for that
case, we should make it a good practice to include the original changeset
identifier in the new log entry.]

Regardless, the committer information is rarely shown by git -- you have
to do something like

  git log --pretty=fuller

just to see it.  The commit hash is far more valuable information.

....Roy


Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Brett Porter <br...@apache.org>.
[snip a whole lot of very useful information about contributions, CLAs and the license that I agree with Roy about]

On 21/12/2011, at 2:57 PM, Roy T. Fielding wrote:

> The CLA has requirements on submitting third-party contributions
> that must be adhered to when pushing stuff to Apache.  We expect
> that those requirements will be satisfied by the commit log.
> As it turns out, that is best accomplished by ensuring that the
> original committer and author identifiers remain those of the
> original author and not that of the pusher, even if it is the
> same person (with different IDs on different repos).  If not,
> the pusher needs to change the log in order to add a
> "Submitted-by: ..." note and whatever else needs to be said in
> accordance with the CLA.  This is independent of how the
> contribution is originally submitted to Apache, and it is the
> PMC's responsibility to ensure all its committers do so when
> appropriate.


TL;DR summary: given that, I think the check is still useful (summary & question at the end if you don't wish to educate me on git)

I am a relative newbie to git, so I may just need someone to explain something to me more.

In some situations Roy has suggested changing the commit log would be required. My understanding is that requires something like:

$ git rebase -i

This will also change the "Committer" field, leaving "Author" intact.

Jukka has suggested "Signed-off-by:", which I understand can be done with:

$ git --amend -s

This will also change the "Committer" field, leaving "Author" intact, making the change fairly pointless for our purposes. It can't be done easily to commits > 1 deep in the history.

Unless I'm missing something, any change to the commit logs means Committer is made to be your own address. Is that right?


If so, we are only talking about examples where the commit log remains 100% intact.

Working through an example:

A pull request referred to by Jukka: https://github.com/callback/callback-ios/pull/36
Contains a commit: https://github.com/forcedotcom/callback-ios/commit/1c3a5b3d70ede5853f1cae35a6159d48192685cb

-----
commit 1c3a5b3d70ede5853f1cae35a6159d48192685cb
Author:     Todd Stellanova <ts...@salesforce.com>
AuthorDate: Fri Nov 11 14:20:52 2011 -0800
Commit:     Todd Stellanova <ts...@salesforce.com>
CommitDate: Fri Nov 11 14:20:52 2011 -0800

    refractor PhoneGapDelegate to allow teardown and reinit of web view
------

My understanding is that doing this is normal:
$ git fetch forcedotcom
$ git merge forcedotcom/master
$ git push origin master

That would leave the commit message above intact.

As a project member, that information seems unhelpful, as we need to go to another source (apparently from the git server ref-update.log) to find out which ASF committer actually put it into the tree. The mail notification may indicate that (I haven't checked what we do there), but it's still detaching some important information from the source control.

Adjusting the committer ID is easy:
$ git rebase --no-ff
(or if just a few need to be handled, git rebase -i)

This does mean all the commit IDs get changed (same as if you have to edit a commit log above, or you alter the commit stream it in some other way). But the downstream user can still work with that, either with:

$ git merge origin/trunk  (some extra commits in the local repository)
$ git rebase origin/trunk (catch up with origin again)

There is the case of someone different from the author committing to the other repository: we would overwrite some information in this case. I can't, however, think of a use case where we care about that information. The author is important, and the ASF committer is important - in a trade off, I would take the ASF committer information over the some-other-repo committer.

Summary - I see the committer check with the following pros:
- ensure commits contain information about which ASF committer was involved (not for provenance, but for project members to know what is going on with a commit)
- prevent accidentally committing rubbish information (default server email address, work email address you didn't mean to publish on the web)
And cons:
- inconvenience of rebasing
- losing track of committers to other repositories that are not the author.

I would presume "committer check" means allowing any name & email that they elect to use, as Paul has described. I think it also means not requiring rebasing of commits from other ASF committers.

On the weight of that, I would find the check worthwhile. I also see no advantage in replacing it by a signed-off-by check.

Do others disagree with this conclusion but agree on the data points, or have I missed some data, pros or cons?

Thanks,
Brett

--
Brett Porter
brett@apache.org
http://brettporter.wordpress.com/
http://au.linkedin.com/in/brettporter
http://twitter.com/brettporter






Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by "Roy T. Fielding" <fi...@gbiv.com>.
On Dec 20, 2011, at 5:46 PM, Paul Davis wrote:

> On Tue, Dec 20, 2011 at 7:03 PM, David Jencks <da...@yahoo.com> wrote:
>> 
>> On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:
>> 
>>> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
>>> <je...@thomersonfamily.com> wrote:
>>>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>>>> 
>>>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>>>> wrote:
>> <giant snip>
>>> 
>>> Once again I'm going to point out that current patches must move
>>> through JIRA. Assuming people follow this policy then the %ce field is
>>> by definition an ASF committer on the Apache project in question. Full
>>> stop.
>>> 
>> 
>> What exactly do you mean by "patch moves through jira" for git?
> 
> Ie, the patch goes to JIRA, the ASF committer the downloads the patch
> from JIRA, applies, reviews, and then if it checks out, finally pushes
> it to master (or where ever is appropriate for integration based on
> that project's workflows). Cassandra has scripts [1] that automate
> most of this.

Apache projects are not required to use Jira.  Contributions can
be contributed using any of our communication forums and they are
considered to be under the Apache License 2.0.  If the author happens
to have a CLA on file, then the CLA overrides the normal contribution
license automatically -- there is no need to check that.

There is no reason to apply this extra level of control within
infrastructure for checking things that any reasonably competent
committer can be trusted to do themselves.  And there is a known
reason not to do so, namely that the committer field in git has
nothing to do with the provenance of the code, but may in fact
vary for the same individual depending on whether they are
interacting with a public repository or their work's repository,
or maybe even their club's repository.  Github is certainly one
example where the committer names will not match our avail names,
and one of the goals of this effort is to enable folks to
use Github as one of many forums for collaborating with potential
recruits.

Yes, that opinion comes from me speaking as a board member and
author of the Apache License, and has previously been cleared
with Apache's legal team for a long ago discussion with Incubator.
We don't need a CLA on file to accept contributions from non-committers.
We just need a clear intent by the author to contribute under
our normal terms.

>> I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.
>> 
> 
> I'm not sure how a pointer to some change set satisfies the policy.
> The underlying motivation for submitting the patch to JIRA is to
> indicate "I submit this code to be included under the ASL 2.0" which
> doesn't seem to hold up if the code isn't actually attached to the
> ticket with the little check box clicked.

We have archives on all of our communication channels.  We don't
need the silly checkbox.  We never have.

>> I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.
>> 
>> thanks
>> david jencks
>> 
>> 
> 
> Git is not the same as SVN. This is specifically dealing with the
> distributed nature of Git and how we can deal with enforcing
> constraints on code uploaded to ASF canonical repos that may have come
> from anywhere. As I've tried to point out if we're just going to
> maintain the same policies we have for SVN then this is all moot
> because patches would have to be applied by hand instead being pulled
> from arbitrary remote repos.

No, they wouldn't.  What makes you think that I can't implement a
tool to apply changes found on a forked subversion instances?
I certainly have the right to do so as an Apache committer.
That is a common operating procedure for some of our projects
where we have a commercially-supported fork in house.
I am trusted to commit to the ASF repository only those changes
that are intended for contribution to Apache.  Git just includes
those tools for us and standardizes the process/identifiers.

> The entire motivation for having these
> checks is to maintain the provenance for contributions without
> requiring that every patch moves through JIRA.

Again, there is no such requirement for commits/pushes at Apache.
The person responsible for moving the bits into our repository
is responsible for verifying that they have the right to do so
before the push is made.  The authors do not need to have a CLA
on file even if the contribution is massive -- CLAs are only
required for the people who want an account at Apache and thus
are allowed to make the decision to push those bits into our
repository.

The CLA has requirements on submitting third-party contributions
that must be adhered to when pushing stuff to Apache.  We expect
that those requirements will be satisfied by the commit log.
As it turns out, that is best accomplished by ensuring that the
original committer and author identifiers remain those of the
original author and not that of the pusher, even if it is the
same person (with different IDs on different repos).  If not,
the pusher needs to change the log in order to add a
"Submitted-by: ..." note and whatever else needs to be said in
accordance with the CLA.  This is independent of how the
contribution is originally submitted to Apache, and it is the
PMC's responsibility to ensure all its committers do so when
appropriate.

....Roy

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 7:03 PM, David Jencks <da...@yahoo.com> wrote:
>
> On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:
>
>> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
>> <je...@thomersonfamily.com> wrote:
>>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>>>
>>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>>> wrote:
> <giant snip>
>>
>> Once again I'm going to point out that current patches must move
>> through JIRA. Assuming people follow this policy then the %ce field is
>> by definition an ASF committer on the Apache project in question. Full
>> stop.
>>
>
> What exactly do you mean by "patch moves through jira" for git?
>

Ie, the patch goes to JIRA, the ASF committer the downloads the patch
from JIRA, applies, reviews, and then if it checks out, finally pushes
it to master (or where ever is appropriate for integration based on
that project's workflows). Cassandra has scripts [1] that automate
most of this.

> I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.
>

I'm not sure how a pointer to some change set satisfies the policy.
The underlying motivation for submitting the patch to JIRA is to
indicate "I submit this code to be included under the ASL 2.0" which
doesn't seem to hold up if the code isn't actually attached to the
ticket with the little check box clicked.

> I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.
>
> thanks
> david jencks
>
>

Git is not the same as SVN. This is specifically dealing with the
distributed nature of Git and how we can deal with enforcing
constraints on code uploaded to ASF canonical repos that may have come
from anywhere. As I've tried to point out if we're just going to
maintain the same policies we have for SVN then this is all moot
because patches would have to be applied by hand instead being pulled
from arbitrary remote repos. The entire motivation for having these
checks is to maintain the provenance for contributions without
requiring that every patch moves through JIRA.

[1] https://github.com/eevans/git-jira-attacher/

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by David Jencks <da...@yahoo.com>.
On Dec 20, 2011, at 3:48 PM, Paul Davis wrote:

> On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
> <je...@thomersonfamily.com> wrote:
>> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>> 
>>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>>> wrote:
<giant snip>
> 
> Once again I'm going to point out that current patches must move
> through JIRA. Assuming people follow this policy then the %ce field is
> by definition an ASF committer on the Apache project in question. Full
> stop.
> 

What exactly do you mean by "patch moves through jira" for git?

I think it means that there's a jira issue in apache jira with a  pointer to the (set of) git commits and the commit message in git has the jira #.  On the projects I work with (with svn) the mention of the jira in the commit message results in jira being able to link to the changes, and we expect all committers to have a jira issue for all non-trivial changes.  I hope the same will be true for git.  I think a pointer to the git change set in some repo is equivalent to attaching a patch to a jira issue.

I'm really not understanding what the point of doing anything other than checking that the person who pushes the work into the asf repository is an asf committer on the project.  That's all we do for svn, right?  We can do reports or whatever on the additional git metadata but until there's a demonstrated problem I don't see why we need to solve it.

thanks
david jencks



Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Dec 20, 2011 at 2:22 PM, Jeremy Thomerson
<je...@thomersonfamily.com> wrote:
> On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:
>
>> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Perhaps we indeed should try discussing this on a higher level.
>> >
>> > I guess we all agree that in the push log we already have a mechanism
>> > that can be used to trace each line of code in a repository back to a
>> > CLA (or a software grant for the initial upload). This was the key
>> > issue that needed to be solved to bring Git up to par with Subversion
>> > in terms of code provenance.
>> >
>>
>> The reflog is useful, but I don't think is the end of the story. For
>> instance, what happens if an ASF committer works on a branch, and then
>> a second ASF committer pushes those changes to master? The reflog
>> would tell us that the first ASF committer introduced the code into
>> the repository, but the second ASF committer is the person responsible
>> for that code eventually going into a release which is (from what I
>> understand) the more important of the two roles.
>
>
>> At this point who gets blamed if something is wrong?
>>
>
> The guy who pushed it into the branch that is released, which is in this
> case the "pusher", which I think corresponds to your "first ASF
> committer".  This is the same as if I am doing development in a branch in
> SVN and some random Joe grabs my work and merges it to trunk.  When someone
> cuts a release it's his fault that it's in the release - not mine as the
> guy who wrote it (somewhere where it wouldn't be released).
>

No, that'd be the second ASF committer that pulled commits into
master. The fact that there's confusion on who's who only serves to
underscore the main point that the whole thing is confusing, so why
not validate CLA's at the commit level?

> Where as, if we instead check that the Git commiter (%ce) field is
>> person who should be cleared by a CLA then we don't have such
>> questions. And we also don't have to concern ourselves for when the
>> initial branch was developed outside of the canonical ASF repository
>> (ie, not in the reflog).
>
>
> But it still brings us back to some guy not being able to actually use the
> distributed feature of the version control system because if he doesn't
> have an ICLA he can not commit (into his own repo) and then ask us to pull
> from that repo.
>

I have no idea what you mean here. Any external contributor can do
whatever they want to their git repo and ask people to pull from
whatever repo and do whatever their hearts desire. What this *does* do
is check that when an ASF committer pushes to the ASF canonical repo
that the commits are tied to a CLA. The current system bases this off
the %ce field. This means that to "sign off" on a commit, someone
cherry-pick'ed or otherwise updated the committer field to reflect
that that ASF committer has cleared the IP as per ASF requirements.

> IMHO, anyone should be able to commit to any repo they want and push to any
> repo they have access to (whether it's on their server, Github or similar
> alternative sites), and then ask us to pull it.  If it's trivial, we can
> pull without ICLA - the same as taking a trivial patch.  If it's
> non-trivial, they need to either have an ICLA or submit it to JIRA and
> check the box.  If it's major, ICLA and possibly more paperwork needs to be
> in order.  But in all those cases it is up to the ASF project committer to
> actually check this before pushing someone else's commits into ASF repo.
>

You keep talking about pulling and it confuses me. These checks are
not based on pulling commits to a local repo. They only come into
effect when the commit is pushed to the ASF canonical repo.

>
>>  > The topic of this debate then turns more around extra rules for
>> > enforcing that even non-committers who contribute a "large enough"
>> > chunk of code should have a CLA on file. The questions then are a)
>> > should we do this, and if yes, b) what's the best way to do it? Here's
>> > my thinking:
>> >
>> > a) So far I haven't seen a compelling reason of why we should have
>> > such rules, just vague references on how we can't trust people to do
>> > the right thing. You asked for a discussion on the "resulting issues
>> > of not having these checks". What issues are those? Why don't we have
>> > them already with Subversion? The idea of such checks sounds useful,
>> > but I'm not convinced that we need them in practice. That said, I'm
>> > open to being convinced otherwise. Until that, -0.
>> >
>>
>> If we don't have such checks and we don't change the existing
>> policies, then we have to ensure that all committers understand that
>> patches go through JIRA and must be applied with git am or similar.
>>
>> There's also the issue that Git patch sets aren't necessarily the same
>> as SVN patches. Git best practices instruct people to write smaller
>> patches in larger numbers so we'd also need to provide new rules of
>> thumb in when a patch set is significantly non-trivial while
>> reflecting on this issue.
>>
>> This isn't about not trusting people so much as our current policies
>> don't allow people to do things they might otherwise assume are
>> reasonable. For instance, your original motivation in this thread had
>> to do with Pull Requests on GitHub. With our current policies, these
>> are forbidden for all but the most trivial of patches (without a good
>> definition for trivial in Git parlance). The entire motivation for
>> having them was to enable participation with contributors using
>> GitHub's model.
>>
>> > b) Assuming there is consensus on having these checks, I think using
>> > %ce is reasonable starting point but not an ideal solution. At the
>> > risk of bikeshedding, here's my alternative proposal:
>> >
>> > * I agree with your idea that a good way to approach this would be to
>> > require extra scrutiny on all changes that come from people who don't
>> > have a CLA on file. Asking a committer to explicitly "approve" such
>> > changes (beyond the simple act committing/pushing them) seems like a
>> > reasonable requirement. I also agree that such approval would be best
>> > recorded in a commit message (even when doing so may break Git
>> > history; that's a good incentive on getting the CLA filed).
>> >
>> > * I disagree that the %ce field is a good place for such information.
>> > First, using it for this purpose overwrites existing information
>> > recorded for a different purpose. Second, the %ce field always gets
>> > set in a git commit/rebase/cherry-pick/etc. operation, regardless of
>> > whether such "approval" of the contribution is indeed intended. Third,
>> > %ce is a pretty low-level field that few people even know about (it
>> > doesn't show up by default in "git log" and isn't visible on Github).
>> >
>> > * IMHO a better place for this information would be a Signed-off-by
>> > line added to the commit message. The Signed-off-by convention was
>> > explicitly created for this purpose [1] and is well supported by many
>> > tools (starting with "git commit -s"). It's also a well-known and
>> > -documented convention whose purpose practically everyone understands
>> > (SCO...). Finally, such Signed-off-by lines are by default visible in
>> > "git log" and on other tools like Github.
>> >
>>
>> I'm not a big fan of managing this with the signed-off-by approach.
>> While it does exist as a "standard" in some projects, it seems like
>> its just a hack because they didn't assign the same legal weight to
>> the committer field before it was too late.
>>
>> I agree that everyone understands it practically, not very many people
>> use it in my experience. Some of the big projects like Linux and Git
>> use it due to SCO issues. I think some tools like Gerrit insert it as
>> well. Other than that I don't see it as a huge standard being picked
>> up by more normal sized projects.
>>
>> Technical implementation wise, it's also a bit of a burden as its not
>> an actual field in the commit message and has to be parsed out by hand
>> AFAIK. Not insurmountable but its mildly annoying.
>>
>> I do see your point about accidentally updating commits with various
>> commands that might not be the intended action. Though I think once
>> people are going to push things towards the canonical repository then
>> their intention should be pretty clear that they've checked and intend
>> these commits for inclusion and have done their duty as required by
>> the CLA.
>>
>
> Doesn't that mean that we're saying "yes, an ASF committer will do his due
> diligence before pushing to ASF canonical repo"?  If so, why are we
> imposing the technical barrier?
>

Yes and I don't know where the second question comes from. Words are
getting a bit overloaded for me to follow your intention.

Yes, ASF committers must do their due diligence. But unless someone
has changed policy about patches going through JIRA then this isn't a
technical barrier at all because by definition the commit will have
the ASF committers email address as the %ce field and all this does is
tell people when they're breaking policy.

>
>>  > * So, instead of checking the %ce field, my proposal would be a check
>> > that verifies that either the recorded author or at least one of the
>> > (possible multiple) signers-off of an incoming git commit has a CLA on
>> > file. And for this to only require rewriting of commits ("git commit
>> > --amend -s") when actually needed, the check should be against all the
>> > recorded CLAs (ideally with possibly multiple email addresses per CLA)
>> > instead of just the committers of a project. Thus an Apache committer
>> > would only need to modify an incoming commit if its author doesn't
>> > already have a CLA on file.
>> >
>>
>> Minus the disagreement on the field we check, this is the goal I'm
>> going for. Anyone with a CLA on file can contribute without requiring
>> that the patch goes through JIRA. I'm open to revising which fields
>> are checked, but it still seems most obvious that the committer (%ce
>> field) is the person directly responsible for having cleared the code
>> and hence is what should be checked.
>>
>
> If we are going to enforce a technical barrier for git pushes, I agree with
> the above that it should be:
>  * any email address on file for a CLA (not just @apache.org)
>  * anyone in the chain of author/committer/signed off by
>
> That being said, I still think that it's fine to just leave it as:
>
> The guy who is pushing is authenticated as an ASF committer for this
> project, and it's his responsibility to ensure that what he is pushing is
> okay to be pushed - the same as it was his responsibility to make sure that
> any code he committed to SVN was okay to commit.  I haven't seen any reason
> for making this a technical restriction when it has been a social
> restriction for years (social in that it's up to our society of ASF
> committers to do the due diligence before committing code, not some commit
> hook that checks it for them).  Sure, git gives us more metadata like
> author/committer/etc, but IMO just because we have that doesn't mean that
> we need to make new rules based on it.
>
> Jeremy

Once again I'm going to point out that current patches must move
through JIRA. Assuming people follow this policy then the %ce field is
by definition an ASF committer on the Apache project in question. Full
stop.

As it currently stands, if people attempt to push something that
doesn't meet this check then they have either misconfigured their
client or they're breaking policy (or they're pushing trivial patches
which, as the name suggests, are trivial to fix). Given this, why
would we *not* want a check here.

Now obviously, this makes Git a bit limiting to people that are used
to using it extensively as a platform for trading patches. In order to
allow such use it has been planned for quite some time to ease the
registration of an ICLA (hopefully even digitally) and an email
address that we can use to clear commits automatically.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jeremy Thomerson <je...@thomersonfamily.com>.
On Tue, Dec 20, 2011 at 3:04 PM, Paul Davis <pa...@gmail.com>wrote:

> On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com>
> wrote:
> > Hi,
> >
> > Perhaps we indeed should try discussing this on a higher level.
> >
> > I guess we all agree that in the push log we already have a mechanism
> > that can be used to trace each line of code in a repository back to a
> > CLA (or a software grant for the initial upload). This was the key
> > issue that needed to be solved to bring Git up to par with Subversion
> > in terms of code provenance.
> >
>
> The reflog is useful, but I don't think is the end of the story. For
> instance, what happens if an ASF committer works on a branch, and then
> a second ASF committer pushes those changes to master? The reflog
> would tell us that the first ASF committer introduced the code into
> the repository, but the second ASF committer is the person responsible
> for that code eventually going into a release which is (from what I
> understand) the more important of the two roles.


> At this point who gets blamed if something is wrong?
>

The guy who pushed it into the branch that is released, which is in this
case the "pusher", which I think corresponds to your "first ASF
committer".  This is the same as if I am doing development in a branch in
SVN and some random Joe grabs my work and merges it to trunk.  When someone
cuts a release it's his fault that it's in the release - not mine as the
guy who wrote it (somewhere where it wouldn't be released).

Where as, if we instead check that the Git commiter (%ce) field is
> person who should be cleared by a CLA then we don't have such
> questions. And we also don't have to concern ourselves for when the
> initial branch was developed outside of the canonical ASF repository
> (ie, not in the reflog).


But it still brings us back to some guy not being able to actually use the
distributed feature of the version control system because if he doesn't
have an ICLA he can not commit (into his own repo) and then ask us to pull
from that repo.

IMHO, anyone should be able to commit to any repo they want and push to any
repo they have access to (whether it's on their server, Github or similar
alternative sites), and then ask us to pull it.  If it's trivial, we can
pull without ICLA - the same as taking a trivial patch.  If it's
non-trivial, they need to either have an ICLA or submit it to JIRA and
check the box.  If it's major, ICLA and possibly more paperwork needs to be
in order.  But in all those cases it is up to the ASF project committer to
actually check this before pushing someone else's commits into ASF repo.


>  > The topic of this debate then turns more around extra rules for
> > enforcing that even non-committers who contribute a "large enough"
> > chunk of code should have a CLA on file. The questions then are a)
> > should we do this, and if yes, b) what's the best way to do it? Here's
> > my thinking:
> >
> > a) So far I haven't seen a compelling reason of why we should have
> > such rules, just vague references on how we can't trust people to do
> > the right thing. You asked for a discussion on the "resulting issues
> > of not having these checks". What issues are those? Why don't we have
> > them already with Subversion? The idea of such checks sounds useful,
> > but I'm not convinced that we need them in practice. That said, I'm
> > open to being convinced otherwise. Until that, -0.
> >
>
> If we don't have such checks and we don't change the existing
> policies, then we have to ensure that all committers understand that
> patches go through JIRA and must be applied with git am or similar.
>
> There's also the issue that Git patch sets aren't necessarily the same
> as SVN patches. Git best practices instruct people to write smaller
> patches in larger numbers so we'd also need to provide new rules of
> thumb in when a patch set is significantly non-trivial while
> reflecting on this issue.
>
> This isn't about not trusting people so much as our current policies
> don't allow people to do things they might otherwise assume are
> reasonable. For instance, your original motivation in this thread had
> to do with Pull Requests on GitHub. With our current policies, these
> are forbidden for all but the most trivial of patches (without a good
> definition for trivial in Git parlance). The entire motivation for
> having them was to enable participation with contributors using
> GitHub's model.
>
> > b) Assuming there is consensus on having these checks, I think using
> > %ce is reasonable starting point but not an ideal solution. At the
> > risk of bikeshedding, here's my alternative proposal:
> >
> > * I agree with your idea that a good way to approach this would be to
> > require extra scrutiny on all changes that come from people who don't
> > have a CLA on file. Asking a committer to explicitly "approve" such
> > changes (beyond the simple act committing/pushing them) seems like a
> > reasonable requirement. I also agree that such approval would be best
> > recorded in a commit message (even when doing so may break Git
> > history; that's a good incentive on getting the CLA filed).
> >
> > * I disagree that the %ce field is a good place for such information.
> > First, using it for this purpose overwrites existing information
> > recorded for a different purpose. Second, the %ce field always gets
> > set in a git commit/rebase/cherry-pick/etc. operation, regardless of
> > whether such "approval" of the contribution is indeed intended. Third,
> > %ce is a pretty low-level field that few people even know about (it
> > doesn't show up by default in "git log" and isn't visible on Github).
> >
> > * IMHO a better place for this information would be a Signed-off-by
> > line added to the commit message. The Signed-off-by convention was
> > explicitly created for this purpose [1] and is well supported by many
> > tools (starting with "git commit -s"). It's also a well-known and
> > -documented convention whose purpose practically everyone understands
> > (SCO...). Finally, such Signed-off-by lines are by default visible in
> > "git log" and on other tools like Github.
> >
>
> I'm not a big fan of managing this with the signed-off-by approach.
> While it does exist as a "standard" in some projects, it seems like
> its just a hack because they didn't assign the same legal weight to
> the committer field before it was too late.
>
> I agree that everyone understands it practically, not very many people
> use it in my experience. Some of the big projects like Linux and Git
> use it due to SCO issues. I think some tools like Gerrit insert it as
> well. Other than that I don't see it as a huge standard being picked
> up by more normal sized projects.
>
> Technical implementation wise, it's also a bit of a burden as its not
> an actual field in the commit message and has to be parsed out by hand
> AFAIK. Not insurmountable but its mildly annoying.
>
> I do see your point about accidentally updating commits with various
> commands that might not be the intended action. Though I think once
> people are going to push things towards the canonical repository then
> their intention should be pretty clear that they've checked and intend
> these commits for inclusion and have done their duty as required by
> the CLA.
>

Doesn't that mean that we're saying "yes, an ASF committer will do his due
diligence before pushing to ASF canonical repo"?  If so, why are we
imposing the technical barrier?


>  > * So, instead of checking the %ce field, my proposal would be a check
> > that verifies that either the recorded author or at least one of the
> > (possible multiple) signers-off of an incoming git commit has a CLA on
> > file. And for this to only require rewriting of commits ("git commit
> > --amend -s") when actually needed, the check should be against all the
> > recorded CLAs (ideally with possibly multiple email addresses per CLA)
> > instead of just the committers of a project. Thus an Apache committer
> > would only need to modify an incoming commit if its author doesn't
> > already have a CLA on file.
> >
>
> Minus the disagreement on the field we check, this is the goal I'm
> going for. Anyone with a CLA on file can contribute without requiring
> that the patch goes through JIRA. I'm open to revising which fields
> are checked, but it still seems most obvious that the committer (%ce
> field) is the person directly responsible for having cleared the code
> and hence is what should be checked.
>

If we are going to enforce a technical barrier for git pushes, I agree with
the above that it should be:
 * any email address on file for a CLA (not just @apache.org)
 * anyone in the chain of author/committer/signed off by

That being said, I still think that it's fine to just leave it as:

The guy who is pushing is authenticated as an ASF committer for this
project, and it's his responsibility to ensure that what he is pushing is
okay to be pushed - the same as it was his responsibility to make sure that
any code he committed to SVN was okay to commit.  I haven't seen any reason
for making this a technical restriction when it has been a social
restriction for years (social in that it's up to our society of ASF
committers to do the due diligence before committing code, not some commit
hook that checks it for them).  Sure, git gives us more metadata like
author/committer/etc, but IMO just because we have that doesn't mean that
we need to make new rules based on it.

Jeremy

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Dec 19, 2011 at 5:15 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> Perhaps we indeed should try discussing this on a higher level.
>
> I guess we all agree that in the push log we already have a mechanism
> that can be used to trace each line of code in a repository back to a
> CLA (or a software grant for the initial upload). This was the key
> issue that needed to be solved to bring Git up to par with Subversion
> in terms of code provenance.
>

The reflog is useful, but I don't think is the end of the story. For
instance, what happens if an ASF committer works on a branch, and then
a second ASF committer pushes those changes to master? The reflog
would tell us that the first ASF committer introduced the code into
the repository, but the second ASF committer is the person responsible
for that code eventually going into a release which is (from what I
understand) the more important of the two roles.

At this point who gets blamed if something is wrong?

Where as, if we instead check that the Git commiter (%ce) field is
person who should be cleared by a CLA then we don't have such
questions. And we also don't have to concern ourselves for when the
initial branch was developed outside of the canonical ASF repository
(ie, not in the reflog).

> The topic of this debate then turns more around extra rules for
> enforcing that even non-committers who contribute a "large enough"
> chunk of code should have a CLA on file. The questions then are a)
> should we do this, and if yes, b) what's the best way to do it? Here's
> my thinking:
>
> a) So far I haven't seen a compelling reason of why we should have
> such rules, just vague references on how we can't trust people to do
> the right thing. You asked for a discussion on the "resulting issues
> of not having these checks". What issues are those? Why don't we have
> them already with Subversion? The idea of such checks sounds useful,
> but I'm not convinced that we need them in practice. That said, I'm
> open to being convinced otherwise. Until that, -0.
>

If we don't have such checks and we don't change the existing
policies, then we have to ensure that all committers understand that
patches go through JIRA and must be applied with git am or similar.

There's also the issue that Git patch sets aren't necessarily the same
as SVN patches. Git best practices instruct people to write smaller
patches in larger numbers so we'd also need to provide new rules of
thumb in when a patch set is significantly non-trivial while
reflecting on this issue.

This isn't about not trusting people so much as our current policies
don't allow people to do things they might otherwise assume are
reasonable. For instance, your original motivation in this thread had
to do with Pull Requests on GitHub. With our current policies, these
are forbidden for all but the most trivial of patches (without a good
definition for trivial in Git parlance). The entire motivation for
having them was to enable participation with contributors using
GitHub's model.

> b) Assuming there is consensus on having these checks, I think using
> %ce is reasonable starting point but not an ideal solution. At the
> risk of bikeshedding, here's my alternative proposal:
>
> * I agree with your idea that a good way to approach this would be to
> require extra scrutiny on all changes that come from people who don't
> have a CLA on file. Asking a committer to explicitly "approve" such
> changes (beyond the simple act committing/pushing them) seems like a
> reasonable requirement. I also agree that such approval would be best
> recorded in a commit message (even when doing so may break Git
> history; that's a good incentive on getting the CLA filed).
>
> * I disagree that the %ce field is a good place for such information.
> First, using it for this purpose overwrites existing information
> recorded for a different purpose. Second, the %ce field always gets
> set in a git commit/rebase/cherry-pick/etc. operation, regardless of
> whether such "approval" of the contribution is indeed intended. Third,
> %ce is a pretty low-level field that few people even know about (it
> doesn't show up by default in "git log" and isn't visible on Github).
>
> * IMHO a better place for this information would be a Signed-off-by
> line added to the commit message. The Signed-off-by convention was
> explicitly created for this purpose [1] and is well supported by many
> tools (starting with "git commit -s"). It's also a well-known and
> -documented convention whose purpose practically everyone understands
> (SCO...). Finally, such Signed-off-by lines are by default visible in
> "git log" and on other tools like Github.
>

I'm not a big fan of managing this with the signed-off-by approach.
While it does exist as a "standard" in some projects, it seems like
its just a hack because they didn't assign the same legal weight to
the committer field before it was too late.

I agree that everyone understands it practically, not very many people
use it in my experience. Some of the big projects like Linux and Git
use it due to SCO issues. I think some tools like Gerrit insert it as
well. Other than that I don't see it as a huge standard being picked
up by more normal sized projects.

Technical implementation wise, it's also a bit of a burden as its not
an actual field in the commit message and has to be parsed out by hand
AFAIK. Not insurmountable but its mildly annoying.

I do see your point about accidentally updating commits with various
commands that might not be the intended action. Though I think once
people are going to push things towards the canonical repository then
their intention should be pretty clear that they've checked and intend
these commits for inclusion and have done their duty as required by
the CLA.

> * So, instead of checking the %ce field, my proposal would be a check
> that verifies that either the recorded author or at least one of the
> (possible multiple) signers-off of an incoming git commit has a CLA on
> file. And for this to only require rewriting of commits ("git commit
> --amend -s") when actually needed, the check should be against all the
> recorded CLAs (ideally with possibly multiple email addresses per CLA)
> instead of just the committers of a project. Thus an Apache committer
> would only need to modify an incoming commit if its author doesn't
> already have a CLA on file.
>

Minus the disagreement on the field we check, this is the goal I'm
going for. Anyone with a CLA on file can contribute without requiring
that the patch goes through JIRA. I'm open to revising which fields
are checked, but it still seems most obvious that the committer (%ce
field) is the person directly responsible for having cleared the code
and hence is what should be checked.

> [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/SubmittingPatches;hb=HEAD#l297
>
> BR,
>
> Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Perhaps we indeed should try discussing this on a higher level.

I guess we all agree that in the push log we already have a mechanism
that can be used to trace each line of code in a repository back to a
CLA (or a software grant for the initial upload). This was the key
issue that needed to be solved to bring Git up to par with Subversion
in terms of code provenance.

The topic of this debate then turns more around extra rules for
enforcing that even non-committers who contribute a "large enough"
chunk of code should have a CLA on file. The questions then are a)
should we do this, and if yes, b) what's the best way to do it? Here's
my thinking:

a) So far I haven't seen a compelling reason of why we should have
such rules, just vague references on how we can't trust people to do
the right thing. You asked for a discussion on the "resulting issues
of not having these checks". What issues are those? Why don't we have
them already with Subversion? The idea of such checks sounds useful,
but I'm not convinced that we need them in practice. That said, I'm
open to being convinced otherwise. Until that, -0.

b) Assuming there is consensus on having these checks, I think using
%ce is reasonable starting point but not an ideal solution. At the
risk of bikeshedding, here's my alternative proposal:

* I agree with your idea that a good way to approach this would be to
require extra scrutiny on all changes that come from people who don't
have a CLA on file. Asking a committer to explicitly "approve" such
changes (beyond the simple act committing/pushing them) seems like a
reasonable requirement. I also agree that such approval would be best
recorded in a commit message (even when doing so may break Git
history; that's a good incentive on getting the CLA filed).

* I disagree that the %ce field is a good place for such information.
First, using it for this purpose overwrites existing information
recorded for a different purpose. Second, the %ce field always gets
set in a git commit/rebase/cherry-pick/etc. operation, regardless of
whether such "approval" of the contribution is indeed intended. Third,
%ce is a pretty low-level field that few people even know about (it
doesn't show up by default in "git log" and isn't visible on Github).

* IMHO a better place for this information would be a Signed-off-by
line added to the commit message. The Signed-off-by convention was
explicitly created for this purpose [1] and is well supported by many
tools (starting with "git commit -s"). It's also a well-known and
-documented convention whose purpose practically everyone understands
(SCO...). Finally, such Signed-off-by lines are by default visible in
"git log" and on other tools like Github.

* So, instead of checking the %ce field, my proposal would be a check
that verifies that either the recorded author or at least one of the
(possible multiple) signers-off of an incoming git commit has a CLA on
file. And for this to only require rewriting of commits ("git commit
--amend -s") when actually needed, the check should be against all the
recorded CLAs (ideally with possibly multiple email addresses per CLA)
instead of just the committers of a project. Thus an Apache committer
would only need to modify an incoming commit if its author doesn't
already have a CLA on file.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/SubmittingPatches;hb=HEAD#l297

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Dec 19, 2011 at 3:41 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Mon, Dec 19, 2011 at 5:28 AM, Paul Davis <pa...@gmail.com> wrote:
>> The committer role is intended to be anyone cleared by an ICLA.
>
> Are we talking about the Git, Subversion or ASF definition of a
> "committer" here? Because that's definitely not what "committer" means
> in Git. The word is heavily overloaded:
>
> * "committer" in Git = someone with a local clone of a Git repository
> * "committer" in Subversion = someone with write access to a repository
> * "committer" in ASF = someone with an ICLA on file and write access
> to a repository
>
> We're confusing the "committer" role in Git with that we use at the
> ASF, even though they are completely separate. Trying to force them to
> be the same is IMHO just creating problems.
>
> A "committer" in Git is more like "contributor" in traditional ASF
> vocabulary. Thus to me the %ce field in a Git commit message reads
> more like "contributor email" than "committer email" in the ASF sense.
>

I think this overloading is the root of the difference in our
interpretation of what's going on here.

I tried to point out that it's the author %ae field that maps to your
idea of "contributor". This is the field that will be correctly
transferred through mediums like email and JIRA correctly. Committer
(%ce) does not which I think invalidates from being used as "the
person who contributed this code".

At this point the %ae field is basically tooling support for the
"Thanks to so-and-so for the patch!" in our commit messages.

This leaves us with the question, "What is the %ce field?" As I
understand you're argument, you're not making a distinction between
%ae and %ce thus it has no special handling.

OTOH, I see it as a field to be used (in a way new to the ASF because
its never existed at the ASF) to ensure that larger contributions are
coming from people with an ICLA. This is *specifically* to allow
people to use external tooling in collaboration with people that are
not "ASF committers" without apprehension on when a patch set becomes
large enough that it requires an ICLA. Or slightly differently, if we
*don't* use this field, then we have to go through and define and
communicate a number of new policies on when a patch set is too large
a contribution without an ICLA.

>> This is more broad than the current ASF committer definition in that it's
>> anyone registered with the foundation as being covered by a CLA of any
>> sort. A large part of the implementation for the infrastructure around
>> this role has not yet been written. In the future there is a general
>> vision of accepting digital ICLA's like Google or other projects.
>
> "more broad than the current", "not yet been written", "in the future"
>
> Sounds like we're writing new ASF policy here.
>

Exactly my point! This separation of roles is brand new and we need to
address the issue. The current implementation was based off the prior
art at the Eclipse Foundation as well as an attempt at matching roles
back to historical policies at the ASF.

>> The third role, pusher, is what is most directly analogous to our
>> current notion of committer at the ASF. This role for the most part
>> would be unchanged. To push code to the ASF repository for Apache Foo
>> you must be an "ASF committer" to that project.
>
> Exactly.
>
>> The author and committer roles are managed and tracked by Git. The
>> role of pusher is OOB from Git and is enforced by our LDAP
>> authentication via HTTPd when the push request is initiated. Each push
>> request is also logged and is the final gate keeper on who is
>> responsible for checking the legal aspects of code contributions.
>
> Exactly.
>
> So why do we need server-side checks on the %ce field in commit messages?
>
> BR,
>
> Jukka Zitting

What we need is a well defined set of policies and supporting
infrastructure for managing our IP and provenance guarantees that
we've historically had with SVN. The current rough scheme is that
trivial contributions aren't an issue, non trivial contributions
require being posted to JIRA, and large contributions require an ICLA.

Unless we change those policies, this entire conversation is moot. If
we start from the point of view of those policies, then the notion of
having people register an ICLA so that we can pull their commits
directly from GitHub without requiring that they go through JIRA makes
a bit more sense.

On the other hand, if we remove all of these checks, and encourage
people to interact with code that was provided as a Pull Request on
GitHub, then someone needs to sit down and write out guidelines for
when we need to have an ICLA on file vs not. What constitutes a
trivial patch (considering in Git that patches tend to be smaller with
a larger number of commits). To date, no one has done this.

Removing these safe guards and just trusting committers to figure it
out during incubation worries me because I don't even know the answers
to these questions. Where as if we have the simple check that the %ce
field has an ICLA on file then these issues disappear. Our current
policies remain valid and people can go do whatever they want on
GitHub.

Perhaps I'm being overly pessimistic on what we can reasonably assume
from letting people just figure this stuff out on their own. All I can
say is that this general framework for  using Git at the ASF that I've
been operating under with the infra group. If it turns out that this
is overly draconian and the consensus is to remove these safeguards
then that's fine. But I would reiterate my desire to have a discussion
on the resulting issues of not having these checks and how those are
going to be addressed.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Dec 19, 2011 at 5:28 AM, Paul Davis <pa...@gmail.com> wrote:
> The committer role is intended to be anyone cleared by an ICLA.

Are we talking about the Git, Subversion or ASF definition of a
"committer" here? Because that's definitely not what "committer" means
in Git. The word is heavily overloaded:

* "committer" in Git = someone with a local clone of a Git repository
* "committer" in Subversion = someone with write access to a repository
* "committer" in ASF = someone with an ICLA on file and write access
to a repository

We're confusing the "committer" role in Git with that we use at the
ASF, even though they are completely separate. Trying to force them to
be the same is IMHO just creating problems.

A "committer" in Git is more like "contributor" in traditional ASF
vocabulary. Thus to me the %ce field in a Git commit message reads
more like "contributor email" than "committer email" in the ASF sense.

> This is more broad than the current ASF committer definition in that it's
> anyone registered with the foundation as being covered by a CLA of any
> sort. A large part of the implementation for the infrastructure around
> this role has not yet been written. In the future there is a general
> vision of accepting digital ICLA's like Google or other projects.

"more broad than the current", "not yet been written", "in the future"

Sounds like we're writing new ASF policy here.

> The third role, pusher, is what is most directly analogous to our
> current notion of committer at the ASF. This role for the most part
> would be unchanged. To push code to the ASF repository for Apache Foo
> you must be an "ASF committer" to that project.

Exactly.

> The author and committer roles are managed and tracked by Git. The
> role of pusher is OOB from Git and is enforced by our LDAP
> authentication via HTTPd when the push request is initiated. Each push
> request is also logged and is the final gate keeper on who is
> responsible for checking the legal aspects of code contributions.

Exactly.

So why do we need server-side checks on the %ce field in commit messages?

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Dec 18, 2011 at 9:30 PM, Paul Davis <pa...@gmail.com> wrote:
> On Sun, Dec 18, 2011 at 1:40 AM, Roy T. Fielding <fi...@gbiv.com> wrote:
>> On Dec 15, 2011, at 8:05 AM, Paul Davis wrote:
>>
>>> On Thu, Dec 15, 2011 at 5:39 AM, Jukka Zitting <ju...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> On Thu, Dec 15, 2011 at 1:53 AM, Paul Davis <pa...@gmail.com> wrote:
>>>>> On the other hand, I do worry about people that aren't completely
>>>>> familiar with Git will end up shooting themselves in the foot. And
>>>>> while we do have policies in place to handle such mess ups, why not
>>>>> spend a bit of time developing a system that automatically prevents a
>>>>> large class of accidents?
>>>>
>>>> The problem here is that such an automatic prevention system is
>>>> interfering with valid workflows (i.e. pull requests) of projects like
>>>> Cordova/Callback that are already familiar with Git.
>>>>
>>>
>>> The issue here is that I'm uncomfortable *not* interfering with these
>>> workflows. Especially for new projects that  might not have
>>> internalized the various requirements for pushing code. Especially for
>>> projects that have spent a good deal of time operating in mode that is
>>> quite specifically not at the same rigorous standards of what it takes
>>> to push code into an Apache project. There absolutely should be a step
>>> where committers are forced to think, "Is this code acceptable for
>>> inclusion?"
>>>
>>> I'm more than willing to change this step so that we use ldap or even
>>> just a text file listing the email address of anyone that is covered
>>> by an ICLA or CCLA. But just removing it and hoping that new
>>> committers grok what it means to vet provenance and maintain the legal
>>> standing of their code base doesn't strike me as a workable solution.
>>> This way new projects have zero questions on what can be included when
>>> they go to commit.
>>
>> Umm, except that check is wrong.  We don't require an iCLA from everyone
>> who contributes to an Apache project.  We only require iCLAs for people
>> who submit major changes (like whole new feature sets) or if they want
>> an account on apache.org (pusher).  The folks who push the code changes
>> to Apache are responsible for ensuring that we have the rights when
>> they push third-party code.
>>
>> I agree with Jukka -- the infrastructure for git should not add
>> restrictions that we have never needed for svn.
>>
>> ....Roy
>>
>
> This doesn't require every commit to be authored by someone with an
> ICLA, it merely requires all commits are committed by someone with an
> ICLA. And the general hope is that this would be extended to be anyone
> that has an ICLA as opposed to a committer on the current project
> which is less restrictive than SVN.
>
> Its also important to note that we're not the only ones with this
> check. The Eclipse project has the same requirement [1].
>
> To say that we're adding a restriction to Git that SVN never needed is
> like saying that cars don't need steering wheels because trains never
> needed one. The distinction between author, committer, and pusher is
> an entirely new concept in Git that doesn't exist in SVN.
>
> Alternatively, the current standard operating procedure is for non
> trivial patches to pass through JIRA to indicate the contribution is
> covered by the ASL 2.0. If we assume that this requirement is still in
> place then the entire check is reduced to a relatively mundane
> assurance that clients are configured properly.
>
> I'm not against looking for alternatives to this check or even
> deleting it entirely, but it worries me a bit that the justification
> for change is allowing GitHub Pull Requests or that its not something
> we had to consider for SVN.
>
> [1] http://wiki.eclipse.org/Git#IP_process_implications_of_DVCS

After chatting with Joe and Gavin on IRC they suggested I should write
down a clearer explanation on the roles and their expected uses.

The basic issue here is that Git introduces tracking for three
distinct roles in adding code to a shared repository: author,
committer, and pusher. It is important to note that the committer role
here is not equivalent to our traditional definition with SVN. The
analogy between the SVN role (which combines all three roles) is most
like the pusher role in Git.

In terms of responsibility for tracking provenance, the author role is
relatively unimportant. For most people it will be a "My name in
print!" type of importance. We'll know who wrote it, but not who to
blame if something is included accidentally.

The committer role is intended to be anyone cleared by an ICLA. This
is more broad than the current ASF committer definition in that it's
anyone registered with the foundation as being covered by a CLA of any
sort. A large part of the implementation for the infrastructure around
this role has not yet been written. In the future there is a general
vision of accepting digital ICLA's like Google or other projects. For
the time being the current implementation limits this field to
committers to Apache Foo because it was the immediately available
option. Talking with Sam Ruby there are a couple places to gather more
email addresses to open this up to a much more generally field but
would still be limited to a finite number of known email addresses.

The third role, pusher, is what is most directly analogous to our
current notion of committer at the ASF. This role for the most part
would be unchanged. To push code to the ASF repository for Apache Foo
you must be an "ASF committer" to that project.

The author and committer roles are managed and tracked by Git. The
role of pusher is OOB from Git and is enforced by our LDAP
authentication via HTTPd when the push request is initiated. Each push
request is also logged and is the final gate keeper on who is
responsible for checking the legal aspects of code contributions.

There are a couple cases here that we can consider in terms of this.

Consider a user Gavin not covered by a CLA that makes a trivial
contribution. Getting that contribution to the central ASF repo
involves a few basic steps: I retrieve it, apply it locally, then push
to the central repo. Depending on the sequence of commands I use, this
may or may not be rejected given the proposed scheme. Basically
depending on my "apply locally" command, it may or may not rewrite the
committer field to be me instead of him. If I use git cherry-pick all
is well. There are a few ways to not rewrite the committer field, but
seeing as a simple command exists to make things work I don't see this
as an issue. Although, my personal rule of thumb is "If I'm not
willing to just typ it while looking at the patch in a web browser, it
requires being posted to JIRA."

The second case that falls into the scope of "different from SVN" is
when a patch is non trivial yet doesn't meet the requirements of an
ICLA. In this case the SOP is still to pass the patch through JIRA.
Which then requires me to pull that patch and apply locally. If I use
git commands to apply the patch it will maintain the author role while
listing me as the committer. Thus this is met by our current checks.

If a contribution is large enough that it requires coverage of a CLA
then people will have been registered and these contributions can be
managed by tools external to the ASF infrastructure (ie, GitHub) and
we can be reasonably sure that accidents won't happen very often
because we have the infrastructure in place to prevent such accidents.

Hopefully, that more concretely describes the issues at hand and the
general ideas that were based on reading guidelines from other groups
like Eclipse that support Git with non-technical regulations like we
have at the ASF.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Sun, Dec 18, 2011 at 1:40 AM, Roy T. Fielding <fi...@gbiv.com> wrote:
> On Dec 15, 2011, at 8:05 AM, Paul Davis wrote:
>
>> On Thu, Dec 15, 2011 at 5:39 AM, Jukka Zitting <ju...@gmail.com> wrote:
>>> Hi,
>>>
>>> On Thu, Dec 15, 2011 at 1:53 AM, Paul Davis <pa...@gmail.com> wrote:
>>>> On the other hand, I do worry about people that aren't completely
>>>> familiar with Git will end up shooting themselves in the foot. And
>>>> while we do have policies in place to handle such mess ups, why not
>>>> spend a bit of time developing a system that automatically prevents a
>>>> large class of accidents?
>>>
>>> The problem here is that such an automatic prevention system is
>>> interfering with valid workflows (i.e. pull requests) of projects like
>>> Cordova/Callback that are already familiar with Git.
>>>
>>
>> The issue here is that I'm uncomfortable *not* interfering with these
>> workflows. Especially for new projects that  might not have
>> internalized the various requirements for pushing code. Especially for
>> projects that have spent a good deal of time operating in mode that is
>> quite specifically not at the same rigorous standards of what it takes
>> to push code into an Apache project. There absolutely should be a step
>> where committers are forced to think, "Is this code acceptable for
>> inclusion?"
>>
>> I'm more than willing to change this step so that we use ldap or even
>> just a text file listing the email address of anyone that is covered
>> by an ICLA or CCLA. But just removing it and hoping that new
>> committers grok what it means to vet provenance and maintain the legal
>> standing of their code base doesn't strike me as a workable solution.
>> This way new projects have zero questions on what can be included when
>> they go to commit.
>
> Umm, except that check is wrong.  We don't require an iCLA from everyone
> who contributes to an Apache project.  We only require iCLAs for people
> who submit major changes (like whole new feature sets) or if they want
> an account on apache.org (pusher).  The folks who push the code changes
> to Apache are responsible for ensuring that we have the rights when
> they push third-party code.
>
> I agree with Jukka -- the infrastructure for git should not add
> restrictions that we have never needed for svn.
>
> ....Roy
>

This doesn't require every commit to be authored by someone with an
ICLA, it merely requires all commits are committed by someone with an
ICLA. And the general hope is that this would be extended to be anyone
that has an ICLA as opposed to a committer on the current project
which is less restrictive than SVN.

Its also important to note that we're not the only ones with this
check. The Eclipse project has the same requirement [1].

To say that we're adding a restriction to Git that SVN never needed is
like saying that cars don't need steering wheels because trains never
needed one. The distinction between author, committer, and pusher is
an entirely new concept in Git that doesn't exist in SVN.

Alternatively, the current standard operating procedure is for non
trivial patches to pass through JIRA to indicate the contribution is
covered by the ASL 2.0. If we assume that this requirement is still in
place then the entire check is reduced to a relatively mundane
assurance that clients are configured properly.

I'm not against looking for alternatives to this check or even
deleting it entirely, but it worries me a bit that the justification
for change is allowing GitHub Pull Requests or that its not something
we had to consider for SVN.

[1] http://wiki.eclipse.org/Git#IP_process_implications_of_DVCS

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by "Roy T. Fielding" <fi...@gbiv.com>.
On Dec 15, 2011, at 8:05 AM, Paul Davis wrote:

> On Thu, Dec 15, 2011 at 5:39 AM, Jukka Zitting <ju...@gmail.com> wrote:
>> Hi,
>> 
>> On Thu, Dec 15, 2011 at 1:53 AM, Paul Davis <pa...@gmail.com> wrote:
>>> On the other hand, I do worry about people that aren't completely
>>> familiar with Git will end up shooting themselves in the foot. And
>>> while we do have policies in place to handle such mess ups, why not
>>> spend a bit of time developing a system that automatically prevents a
>>> large class of accidents?
>> 
>> The problem here is that such an automatic prevention system is
>> interfering with valid workflows (i.e. pull requests) of projects like
>> Cordova/Callback that are already familiar with Git.
>> 
> 
> The issue here is that I'm uncomfortable *not* interfering with these
> workflows. Especially for new projects that  might not have
> internalized the various requirements for pushing code. Especially for
> projects that have spent a good deal of time operating in mode that is
> quite specifically not at the same rigorous standards of what it takes
> to push code into an Apache project. There absolutely should be a step
> where committers are forced to think, "Is this code acceptable for
> inclusion?"
> 
> I'm more than willing to change this step so that we use ldap or even
> just a text file listing the email address of anyone that is covered
> by an ICLA or CCLA. But just removing it and hoping that new
> committers grok what it means to vet provenance and maintain the legal
> standing of their code base doesn't strike me as a workable solution.
> This way new projects have zero questions on what can be included when
> they go to commit.

Umm, except that check is wrong.  We don't require an iCLA from everyone
who contributes to an Apache project.  We only require iCLAs for people
who submit major changes (like whole new feature sets) or if they want
an account on apache.org (pusher).  The folks who push the code changes
to Apache are responsible for ensuring that we have the rights when
they push third-party code.

I agree with Jukka -- the infrastructure for git should not add
restrictions that we have never needed for svn.

....Roy


Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Dec 15, 2011 at 5:05 PM, Paul Davis <pa...@gmail.com> wrote:
> The issue here is that I'm uncomfortable *not* interfering with these
> workflows. Especially for new projects that  might not have
> internalized the various requirements for pushing code. Especially for
> projects that have spent a good deal of time operating in mode that is
> quite specifically not at the same rigorous standards of what it takes
> to push code into an Apache project. There absolutely should be a step
> where committers are forced to think, "Is this code acceptable for
> inclusion?"

I see the point, but isn't that exactly what the Incubator is here for.

In Cordova (Callback) the committers are already figuring out these
rules even when their code is still on Github with no such
restrictions; see for example
http://markmail.org/message/ro7mkcdr6j7awhm2.

> I'm more than willing to change this step so that we use ldap or even
> just a text file listing the email address of anyone that is covered
> by an ICLA or CCLA. But just removing it and hoping that new
> committers grok what it means to vet provenance and maintain the legal
> standing of their code base doesn't strike me as a workable solution.
> This way new projects have zero questions on what can be included when
> they go to commit.

We don't have such a mechanism for svn. Why would we need one for git?

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Dec 15, 2011 at 5:39 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Thu, Dec 15, 2011 at 1:53 AM, Paul Davis <pa...@gmail.com> wrote:
>> On the other hand, I do worry about people that aren't completely
>> familiar with Git will end up shooting themselves in the foot. And
>> while we do have policies in place to handle such mess ups, why not
>> spend a bit of time developing a system that automatically prevents a
>> large class of accidents?
>
> The problem here is that such an automatic prevention system is
> interfering with valid workflows (i.e. pull requests) of projects like
> Cordova/Callback that are already familiar with Git.
>

The issue here is that I'm uncomfortable *not* interfering with these
workflows. Especially for new projects that  might not have
internalized the various requirements for pushing code. Especially for
projects that have spent a good deal of time operating in mode that is
quite specifically not at the same rigorous standards of what it takes
to push code into an Apache project. There absolutely should be a step
where committers are forced to think, "Is this code acceptable for
inclusion?"

I'm more than willing to change this step so that we use ldap or even
just a text file listing the email address of anyone that is covered
by an ICLA or CCLA. But just removing it and hoping that new
committers grok what it means to vet provenance and maintain the legal
standing of their code base doesn't strike me as a workable solution.
This way new projects have zero questions on what can be included when
they go to commit.

> That's why I'd rather have this as an optional feature that could be
> turned off by projects that would rather use social means to prevent
> (and recover from) accidents.
>
> BR,
>
> Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Dec 15, 2011 at 1:53 AM, Paul Davis <pa...@gmail.com> wrote:
> On the other hand, I do worry about people that aren't completely
> familiar with Git will end up shooting themselves in the foot. And
> while we do have policies in place to handle such mess ups, why not
> spend a bit of time developing a system that automatically prevents a
> large class of accidents?

The problem here is that such an automatic prevention system is
interfering with valid workflows (i.e. pull requests) of projects like
Cordova/Callback that are already familiar with Git.

That's why I'd rather have this as an optional feature that could be
turned off by projects that would rather use social means to prevent
(and recover from) accidents.

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
> Right. With Git the one who made the commit could well be someone
> who's not an official committer. (Yes, the traditional terminology is
> breaking down here...)

Heh, yeah. It gets a bit weird when there are three or more roles
involved in what used to be one. Though the issue here is that I've
been running under the assumption that we should be very specific in
checking who made the commit and verified that the committer is in
fact ok to be committing.

Granted, this is a bit at odds with the role of the person pushing the
commit to the official repository. In terms of SVN, the person sending
the push is most analogous to the SVN committer. In which case, for
provenance we'd be just as (historically speaking) thorough in
tracking only this.

On the other hand, the fact that we have a fairly easy method to track
that every commit was committed and/or authored by someone that has an
ICLA/CCLA on file, we could be even more precise in our tracking of
commits.

As always, the first and foremost barrier to managing our repositories
lies in the hands of each person with authorization to push code to
the canonical repository (traditionally referred to as a committer).
On the other hand, I do worry about people that aren't completely
familiar with Git will end up shooting themselves in the foot. And
while we do have policies in place to handle such mess ups, why not
spend a bit of time developing a system that automatically prevents a
large class of accidents?

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Dec 15, 2011 at 10:23 AM, Simon Pepping <sa...@gmail.com> wrote:
> With a workflow in which one pushes commits unmodified to
> the a.o repo, what is the meaning of the Committer entry?
> It says that someone committed it to some repo and nobody
> rewrote it since then.

Exactly.

> Either it should be rewritten when a commit is committed to
> the a.o repo, or it should be considered as meaningless.

Why? In the ref-updates.log we already have a separate (more reliable)
record of who pushed each commit to the canonical a.o repository.

Consider for example a case of someone submitting a patch to Jira. If
then some other contributor (not a committer of the project) takes
that patch, git commits it, and comments on Jira: "I have tested and
committed the above patch. Please pull the change as commit X from my
repository at Y".

In such a case the %ce field would contain valid information that's
distinct from the %ae field and that would need to be overwritten with
the current pre_receive checks. And should someone then take that
commit X and submit some further improvements based on it, they'd then
need to explicitly rebase their changes if the original commit X gets
rewritten.

So the way I see it, the requirement to overwrite the %ce field of
incoming commits has both social and technical drawbacks. As said in
the other message, it's fine if a project thinks these drawbacks are
acceptable for the benefits mentioned by Paul, but I wouldn't mandate
them on all projects.

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Simon Pepping <sa...@gmail.com>.
With a workflow in which one pushes commits unmodified to the a.o repo,
what is the meaning of the Committer entry? It says that someone committed
it to some repo and nobody rewrote it since then. Either it should be
rewritten when a commit is committed to the a.o repo, or it should be
considered as meaningless. In the latter case, we introduce the role of
Pusher, which has no place in git's records. This discussion tries to find
such a place; is that useful?

Simon

On Thu, Dec 15, 2011 at 01:23, Jukka Zitting <ju...@gmail.com>wrote:


> The problem I see with this is that it prevents commits made by
> non-committers (anyone can commit with Git) to be pushed without these
> commits being rebased or otherwise mangled.
>
> IMHO we shouldn't be putting restrictions on who can commit (in the
> "git commit" sense) to a repository. Instead the restriction (and
> related audit trails) should be on who can make those commits a part
> of the canonical repository on git-wip-us (i.e. "git push"). That's
> more in line with the distributed nature of Git.
>
> > I'm not sure I follow. The %ae field should always be the person that
> > wrote the commit. The %ce is who made the commit.
>
> Right. With Git the one who made the commit could well be someone
> who's not an official committer. (Yes, the traditional terminology is
> breaking down here...)
>
>

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Dec 14, 2011 at 11:24 PM, Paul Davis
<pa...@gmail.com> wrote:
> If I read your case right, then the confusion would be when 8de5e722ea
> already exists in the repository.

Right. The current setup doesn't really give us an easy and
straightforward record of who pushed each commit to the canonical
repository. Anyway, as said, this information can be deduced from the
current records, so it's not really a fundamental issue. It just
seemed to me like a possible reason for the restrictions on the %ce
field.

> I don't think this is the same thing. The ref-updates.log contains who
> updated what refs during a push. This is separate from who made the
> commit. Ie, the point of checking %ce in the commit is to check that a
> commit was entered by a project committer but this importantly not the
> same as who's making the push. Ie, this allows a committer to push
> updates that include commits by other committers.

The problem I see with this is that it prevents commits made by
non-committers (anyone can commit with Git) to be pushed without these
commits being rebased or otherwise mangled.

IMHO we shouldn't be putting restrictions on who can commit (in the
"git commit" sense) to a repository. Instead the restriction (and
related audit trails) should be on who can make those commits a part
of the canonical repository on git-wip-us (i.e. "git push"). That's
more in line with the distributed nature of Git.

> I'm not sure I follow. The %ae field should always be the person that
> wrote the commit. The %ce is who made the commit.

Right. With Git the one who made the commit could well be someone
who's not an official committer. (Yes, the traditional terminology is
breaking down here...)

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 14, 2011 at 4:24 PM, Paul Davis <pa...@gmail.com> wrote:
> On Wed, Dec 14, 2011 at 1:15 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> Hi,
>>
>> On Wed, Dec 14, 2011 at 6:04 PM, Paul Davis <pa...@gmail.com> wrote:
>>> The ref updates are already logged with the change, the ldap
>>> authorized id, time and ip of who pushed so that should be sufficient
>>> there.
>>
>> That information is enough to determine who pushed each commit, but it
>> could be rather tricky. For example, consider a case where a branch
>> starts at commit X and then someone pushes it to X -> Y -> Z. This
>> would be recorded as a change from X to Z for that branch. Now if
>> someone else creates a new branch at Y and makes a push, from the
>> ref-updates.log it'll look like that's when the commit entered the
>> repository.
>>
>
> To be more concrete here, an actual ref-update.log entry looks like this:
>
> [Wed Dec 14 05:55:16 2011] refs/heads/master 0658d982e4 -> 2d90a1249d
> randall@http.98.210.155.124
>
> And a branch creation looks like this:
>
> [Sun Nov 27 01:08:05 2011] refs/heads/COUCHDB-431 0000000000 ->
> 8de5e722ea randall@http.98.210.155.124
>
> If I read your case right, then the confusion would be when 8de5e722ea
> already exists in the repository.
>
> To say that this makes it appear that 8de5e722ea didn't exist in the
> repository I think is juts confusion on what we're logging. These are
> just ref updates which contain all the required information to rebuild
> when commits entered the repository.
>
> I do agree though that mapping back to when things were uploaded would
> take more effort on the analysis side but the trade off is that the
> server hooks get a bit more complicated. In this particular case I'd
> lean towards keeping the server side as simple and robust as possible
> and then if anyone needs to do the analysis of what commits were
> pushed at any given point they can do the (somewhat less trivial) log
> analysis.
>
>> Cases like that could be resolved by looking at the full commit graph
>> and the timestamps in ref-updates.log, but it would be much easier if
>> each push simply recorded the hash of each new commit that got
>> uploaded by that user.
>>
>
> I'm not sure I agree with the characterization of much here. It'd be a
> relatively simple script to read lines from this and check for commits
> that didn't already exist in the repository. On the other hand, moving
> to the log entry per commit means that we have to play around with how
> we log things a bit. The obvious way is one line per commit sha, but
> then we're losing the relation of the update. Or we could log the
> commits and then a line for the update, but then the log has multiple
> times of records.
>

It occurs to me that if we did do the id.apache.org integration,
having a second log that mapped Apache id's to the commits would be
helpful to combat email turn over and such things.

>> Such a log of pushed commits would also contain the information that I
>> believe the current intention is to have in the %ce entries.
>>
>
> I don't think this is the same thing. The ref-updates.log contains who
> updated what refs during a push. This is separate from who made the
> commit. Ie, the point of checking %ce in the commit is to check that a
> commit was entered by a project committer but this importantly not the
> same as who's making the push. Ie, this allows a committer to push
> updates that include commits by other committers.
>
>>> As to the committer email checks, I was a bit uncertain when
>>> implementing this part. One the one hand I wanted to be very specific
>>> that every commit was committed by someone acceptable for the project
>>> but this has the already pointed out draw back that it requires
>>> committers to rebase commits after reviewing them which doesn't work
>>> so hot with some peoples' work flow.
>>
>> Right. That's why I'd like to drop that check, or at least make it
>> configurable per repository.
>>
>> There's also the problem that overriding the %ce field in a rebase or
>> an explicit filter-branch operation could in some cases destroy useful
>> information (distinction between who wrote the change and who made the
>> original git commit), which wouldn't be a problem for the separate
>> commit log or the Signed-Off-By entries I mentioned.
>>
>
> I'm not sure I follow. The %ae field should always be the person that
> wrote the commit. The %ce is who made the commit. As to Signed-Off-By,
> I'm purposefully avoiding any use of that in the hooks and leaving
> it's use open to individual projects to use as they see fit.
>
>>> In a perfect world what I'd really like to see here is the ldap
>>> self-service have email fields and an ICLA/CCLA check boolean so that
>>> we could just check that the email address is connected to a CLA of
>>> some sort. I'm quite hesitant to remove all checks from this part
>>> because it's quite easy for those fields to be garbage when someone
>>> forgets to set their name or if someone accidentally pulls unreviewed
>>> code in a fat finger branch update.
>>
>> I'd treat those as social problems that are best solved by social
>> means. Introducing technical barriers will just force people to use
>> awkward workarounds when things like pull requests don't work like
>> they'd expect. After all, there are plenty of cases where people have
>> accidentally committed extra files or other garbage with Subversion. I
>> don't see much difference here.
>>
>> I wouldn't object to having the %ce check available as an option for
>> projects that want it or even enabled by default, but IMHO it
>> shouldn't be mandatory for all projects.
>>
>> BR,
>>
>> Jukka Zitting
>
> I would argue quite the opposite. Specifically because of things like
> pull requests we should be going the extra mile here to try and
> prevent random garbage from entering the repository. I do understand
> that the %ce check is a bit clunky, but I can already say that it's
> prevented commits into CouchDB that would be at best an eyesore and at
> worst a provenance problem.
>
> I don't think the way to make this better is to throw away the check,
> but to make the check better by integrating with id.apache.org and
> setting it up so that people can submit ICLA/CCLA's online.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 14, 2011 at 1:15 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Dec 14, 2011 at 6:04 PM, Paul Davis <pa...@gmail.com> wrote:
>> The ref updates are already logged with the change, the ldap
>> authorized id, time and ip of who pushed so that should be sufficient
>> there.
>
> That information is enough to determine who pushed each commit, but it
> could be rather tricky. For example, consider a case where a branch
> starts at commit X and then someone pushes it to X -> Y -> Z. This
> would be recorded as a change from X to Z for that branch. Now if
> someone else creates a new branch at Y and makes a push, from the
> ref-updates.log it'll look like that's when the commit entered the
> repository.
>

To be more concrete here, an actual ref-update.log entry looks like this:

[Wed Dec 14 05:55:16 2011] refs/heads/master 0658d982e4 -> 2d90a1249d
randall@http.98.210.155.124

And a branch creation looks like this:

[Sun Nov 27 01:08:05 2011] refs/heads/COUCHDB-431 0000000000 ->
8de5e722ea randall@http.98.210.155.124

If I read your case right, then the confusion would be when 8de5e722ea
already exists in the repository.

To say that this makes it appear that 8de5e722ea didn't exist in the
repository I think is juts confusion on what we're logging. These are
just ref updates which contain all the required information to rebuild
when commits entered the repository.

I do agree though that mapping back to when things were uploaded would
take more effort on the analysis side but the trade off is that the
server hooks get a bit more complicated. In this particular case I'd
lean towards keeping the server side as simple and robust as possible
and then if anyone needs to do the analysis of what commits were
pushed at any given point they can do the (somewhat less trivial) log
analysis.

> Cases like that could be resolved by looking at the full commit graph
> and the timestamps in ref-updates.log, but it would be much easier if
> each push simply recorded the hash of each new commit that got
> uploaded by that user.
>

I'm not sure I agree with the characterization of much here. It'd be a
relatively simple script to read lines from this and check for commits
that didn't already exist in the repository. On the other hand, moving
to the log entry per commit means that we have to play around with how
we log things a bit. The obvious way is one line per commit sha, but
then we're losing the relation of the update. Or we could log the
commits and then a line for the update, but then the log has multiple
times of records.

> Such a log of pushed commits would also contain the information that I
> believe the current intention is to have in the %ce entries.
>

I don't think this is the same thing. The ref-updates.log contains who
updated what refs during a push. This is separate from who made the
commit. Ie, the point of checking %ce in the commit is to check that a
commit was entered by a project committer but this importantly not the
same as who's making the push. Ie, this allows a committer to push
updates that include commits by other committers.

>> As to the committer email checks, I was a bit uncertain when
>> implementing this part. One the one hand I wanted to be very specific
>> that every commit was committed by someone acceptable for the project
>> but this has the already pointed out draw back that it requires
>> committers to rebase commits after reviewing them which doesn't work
>> so hot with some peoples' work flow.
>
> Right. That's why I'd like to drop that check, or at least make it
> configurable per repository.
>
> There's also the problem that overriding the %ce field in a rebase or
> an explicit filter-branch operation could in some cases destroy useful
> information (distinction between who wrote the change and who made the
> original git commit), which wouldn't be a problem for the separate
> commit log or the Signed-Off-By entries I mentioned.
>

I'm not sure I follow. The %ae field should always be the person that
wrote the commit. The %ce is who made the commit. As to Signed-Off-By,
I'm purposefully avoiding any use of that in the hooks and leaving
it's use open to individual projects to use as they see fit.

>> In a perfect world what I'd really like to see here is the ldap
>> self-service have email fields and an ICLA/CCLA check boolean so that
>> we could just check that the email address is connected to a CLA of
>> some sort. I'm quite hesitant to remove all checks from this part
>> because it's quite easy for those fields to be garbage when someone
>> forgets to set their name or if someone accidentally pulls unreviewed
>> code in a fat finger branch update.
>
> I'd treat those as social problems that are best solved by social
> means. Introducing technical barriers will just force people to use
> awkward workarounds when things like pull requests don't work like
> they'd expect. After all, there are plenty of cases where people have
> accidentally committed extra files or other garbage with Subversion. I
> don't see much difference here.
>
> I wouldn't object to having the %ce check available as an option for
> projects that want it or even enabled by default, but IMHO it
> shouldn't be mandatory for all projects.
>
> BR,
>
> Jukka Zitting

I would argue quite the opposite. Specifically because of things like
pull requests we should be going the extra mile here to try and
prevent random garbage from entering the repository. I do understand
that the %ce check is a bit clunky, but I can already say that it's
prevented commits into CouchDB that would be at best an eyesore and at
worst a provenance problem.

I don't think the way to make this better is to throw away the check,
but to make the check better by integrating with id.apache.org and
setting it up so that people can submit ICLA/CCLA's online.

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Dec 14, 2011 at 6:04 PM, Paul Davis <pa...@gmail.com> wrote:
> The ref updates are already logged with the change, the ldap
> authorized id, time and ip of who pushed so that should be sufficient
> there.

That information is enough to determine who pushed each commit, but it
could be rather tricky. For example, consider a case where a branch
starts at commit X and then someone pushes it to X -> Y -> Z. This
would be recorded as a change from X to Z for that branch. Now if
someone else creates a new branch at Y and makes a push, from the
ref-updates.log it'll look like that's when the commit entered the
repository.

Cases like that could be resolved by looking at the full commit graph
and the timestamps in ref-updates.log, but it would be much easier if
each push simply recorded the hash of each new commit that got
uploaded by that user.

Such a log of pushed commits would also contain the information that I
believe the current intention is to have in the %ce entries.

> As to the committer email checks, I was a bit uncertain when
> implementing this part. One the one hand I wanted to be very specific
> that every commit was committed by someone acceptable for the project
> but this has the already pointed out draw back that it requires
> committers to rebase commits after reviewing them which doesn't work
> so hot with some peoples' work flow.

Right. That's why I'd like to drop that check, or at least make it
configurable per repository.

There's also the problem that overriding the %ce field in a rebase or
an explicit filter-branch operation could in some cases destroy useful
information (distinction between who wrote the change and who made the
original git commit), which wouldn't be a problem for the separate
commit log or the Signed-Off-By entries I mentioned.

> In a perfect world what I'd really like to see here is the ldap
> self-service have email fields and an ICLA/CCLA check boolean so that
> we could just check that the email address is connected to a CLA of
> some sort. I'm quite hesitant to remove all checks from this part
> because it's quite easy for those fields to be garbage when someone
> forgets to set their name or if someone accidentally pulls unreviewed
> code in a fat finger branch update.

I'd treat those as social problems that are best solved by social
means. Introducing technical barriers will just force people to use
awkward workarounds when things like pull requests don't work like
they'd expect. After all, there are plenty of cases where people have
accidentally committed extra files or other garbage with Subversion. I
don't see much difference here.

I wouldn't object to having the %ce check available as an option for
projects that want it or even enabled by default, but IMHO it
shouldn't be mandatory for all projects.

BR,

Jukka Zitting

Re: @apache.org commit address requirement (Was: Git hosting is go)

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Dec 14, 2011 at 7:19 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> The asfgit pre_receive hook currently checks that each commit being
> pushed has its committer email (%ce) set to the @apache.org address of
> one of committers of the respective project.
>
> This seems too strict and makes it quite cumbersome to handle things
> like pull requests where you'd want to take in incoming commits as-is
> and just merge them to the main branch instead of rebasing or
> rewriting the commits. We can't even rely on the %ce information being
> correct since it's set on the client side.
>
> Instead of this %ce check I'd leave the commit messages as-is and
> rather augment the ref-updates.log with information about who pushed
> each individual commit. That gives us a more reliable record for tying
> individual changes to ICLAs than the %ce addresses in commit messages.
>
> Alternatively, if we do want to include explicit and properly verified
> ICLA references in the actual Git commit messages (instead of an
> out-of-band mechanism like ref-updates.log), then I'd rather use
> separate Signed-Off-By labels that the committer who pushes a set of
> commits needs to add to each commit being pushed.
>
> WDYT? I can make the changes if we can reach consensus on this.
>
> [See below for the relevant context from callback-dev@]
>
> BR,
>
> Jukka Zitting
>
> ---------- Forwarded message ----------
> From: Jukka Zitting <ju...@gmail.com>
> Date: Wed, Dec 14, 2011 at 1:58 PM
> Subject: Re: Git hosting is go
> To: callback-dev@incubator.apache.org
>
>
> Hi,
>
>
> On Wed, Dec 14, 2011 at 12:05 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> I'm not sure how strict the backend is about commits pulled from
>> elsewhere. Ideally it should only record the currently authenticated
>> user or at most require a Signed-Off-By entry from the committer who's
>> pushing the commits. I'll see what I can find out.
>
> The backend check that the committer (not author) address of each
> commit being pushed is an @apache.org address of one of the committers
> authorized to push to the repository.
>
> Checking the committer address seems way too strict as it would
> require breaking history on all pull requests from outside the core
> development group. I'll bring this up on infra-dev@ [1], to see how we
> could best address the issue.
>
> [1] http://mail-archives.apache.org/mod_mbox/www-infrastructure-dev/
>
> BR,
>
> Jukka Zitting

The ref updates are already logged with the change, the ldap
authorized id, time and ip of who pushed so that should be sufficient
there.

As to the committer email checks, I was a bit uncertain when
implementing this part. One the one hand I wanted to be very specific
that every commit was committed by someone acceptable for the project
but this has the already pointed out draw back that it requires
committers to rebase commits after reviewing them which doesn't work
so hot with some peoples' work flow.

In a perfect world what I'd really like to see here is the ldap
self-service have email fields and an ICLA/CCLA check boolean so that
we could just check that the email address is connected to a CLA of
some sort. I'm quite hesitant to remove all checks from this part
because it's quite easy for those fields to be garbage when someone
forgets to set their name or if someone accidentally pulls unreviewed
code in a fat finger branch update.

Also I do realize that this isn't a security thing by any means.
Anyone could just fake these fields but at that point they'd have to
go out of their way to lie about who wrote it and the onus is still on
the person that triggered the push command. But on the other hand
checking that the committer is valid seems beneficial enough to not
drop it entirely.