You are viewing a plain text version of this content. The canonical link for it is here.
Posted to legal-discuss@apache.org by Greg Stein <gs...@gmail.com> on 2017/08/07 23:51:04 UTC

Git and Provenance

Hey all,

Just wanted to confirm my thoughts about provenance, as recorded (or
deleted!) by the git tool.

The short answer is that git, unlike svn, allows a PMC to remove certain
types of development history. The master/develop branch cannot be modified
(generally), but any development that occurs on a branch can be lost.

At least a couple forms of loss that I can think of:

1) a series of commits to a branch are "squashed" into a single commit,
then merged to master. then, the branch is deleted. we no longer have the
individual commits.

2) a branch is used to construct a release, and is later deleted.

There are likely other scenarios, but having even one is enough for my
query/discussion.

The ASF will capture email diffs and push logs of all changes made, to all
branches. These are stored in our email archives and in a push log
database. So provenance might not be stored entirely in the repository, but
we still have all the data (caveat: bugs in our recording).

I believe this is sufficient for the Foundation's needs.

Given the various bits above, the Infrastructure Team doesn't have any
plans to change things, but I felt it best to confirm/ask.

Thanks,
Greg Stein
Infrastructure Administrator, ASF


ps. IMO, we have very little need for provenance/history; in actuality,
this seems more like data for downstream users, in the event of (legal)
trouble

Re: Git and Provenance

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stian Soiland-Reyes wrote on Fri, 11 Aug 2017 09:51 +0100:
> A diff email will contain the commit IDs which are content-addressable, and
> the git server should (?) not accept a blob with the wrong checksum

IIRC that requires setting transfer.fsckObjects=true on the server.  (that knob defaults to false)

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Stian Soiland-Reyes <st...@apache.org>.
Agree that this is sufficient from a legal perspective. If someone
maliciously wanted to take a contribution there would be many other attack
vectors. We also still assume ASF committers are good citizens, e.g. don't
share passwords, and that PMCs provide oversight on commit activity in
their repositories.

(We could in theory be stricter and require gpg signed git merge commits,
like Eclipse, which could clarify better which committer merged an external
contribution)

A diff email will contain the commit IDs which are content-addressable, and
the git server should (?) not accept a blob with the wrong checksum - so
the email record is also a receipt for the git server accepting the commit,
even if that commit is later deleted from a transient branch.


On 8 Aug 2017 1:51 am, "Greg Stein" <gs...@gmail.com> wrote:

> Hey all,
>
> Just wanted to confirm my thoughts about provenance, as recorded (or
> deleted!) by the git tool.
>
> The short answer is that git, unlike svn, allows a PMC to remove certain
> types of development history. The master/develop branch cannot be modified
> (generally), but any development that occurs on a branch can be lost.
>
> At least a couple forms of loss that I can think of:
>
> 1) a series of commits to a branch are "squashed" into a single commit,
> then merged to master. then, the branch is deleted. we no longer have the
> individual commits.
>
> 2) a branch is used to construct a release, and is later deleted.
>
> There are likely other scenarios, but having even one is enough for my
> query/discussion.
>
> The ASF will capture email diffs and push logs of all changes made, to all
> branches. These are stored in our email archives and in a push log
> database. So provenance might not be stored entirely in the repository, but
> we still have all the data (caveat: bugs in our recording).
>
> I believe this is sufficient for the Foundation's needs.
>
> Given the various bits above, the Infrastructure Team doesn't have any
> plans to change things, but I felt it best to confirm/ask.
>
> Thanks,
> Greg Stein
> Infrastructure Administrator, ASF
>
>
> ps. IMO, we have very little need for provenance/history; in actuality,
> this seems more like data for downstream users, in the event of (legal)
> trouble
>
>

Re: Git and Provenance

Posted by Jim Jagielski <ji...@jaguNET.com>.
+1 from my PoV.

> On Aug 7, 2017, at 7:51 PM, Greg Stein <gs...@gmail.com> wrote:
> 
> Hey all,
> 
> Just wanted to confirm my thoughts about provenance, as recorded (or deleted!) by the git tool.
> 
> The short answer is that git, unlike svn, allows a PMC to remove certain types of development history. The master/develop branch cannot be modified (generally), but any development that occurs on a branch can be lost.
> 
> At least a couple forms of loss that I can think of:
> 
> 1) a series of commits to a branch are "squashed" into a single commit, then merged to master. then, the branch is deleted. we no longer have the individual commits.
> 
> 2) a branch is used to construct a release, and is later deleted.
> 
> There are likely other scenarios, but having even one is enough for my query/discussion.
> 
> The ASF will capture email diffs and push logs of all changes made, to all branches. These are stored in our email archives and in a push log database. So provenance might not be stored entirely in the repository, but we still have all the data (caveat: bugs in our recording).
> 
> I believe this is sufficient for the Foundation's needs.
> 
> Given the various bits above, the Infrastructure Team doesn't have any plans to change things, but I felt it best to confirm/ask.
> 
> Thanks,
> Greg Stein
> Infrastructure Administrator, ASF
> 
> 
> ps. IMO, we have very little need for provenance/history; in actuality, this seems more like data for downstream users, in the event of (legal) trouble
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Greg Stein <gs...@gmail.com>.
On Aug 9, 2017 12:41, "Daniel Shahaf" <d....@daniel.shahaf.name> wrote:
>...

> This is where (IMO) we decide the business risk is so freakishly low, that
> we do not require any operational changes to mitigate that risk. (*)

Yes, but the cost of for
archiving every commit object ever seen on *git*.apache.org is _also_
freakishly low.  (The storage cost of commit objects is bounded by some
constant times the typing speed of whoever authored the commit...)


We were seeing problems keeping branches, when the upstream repo had
deleted them. Our sync code now uses --prune to toss such work, to keep the
repos synced.

Cheers,
-g

Re: Git and Provenance

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Daniel Shahaf wrote on Wed, 09 Aug 2017 17:41 +0000:
> Greg Stein wrote on Wed, 09 Aug 2017 06:37 -0500:
> > This is where (IMO) we decide the business risk is so freakishly low, that
> > we do not require any operational changes to mitigate that risk. (*)
> 
> Yes, but the cost of for

s/for//

> archiving every commit object ever seen on *git*.apache.org is _also_
> freakishly low.  (The storage cost of commit objects is bounded by some
> constant times the typing speed of whoever authored the commit...)


---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Greg Stein wrote on Wed, 09 Aug 2017 06:37 -0500:
> On Tue, Aug 8, 2017 at 10:08 AM, Daniel Shahaf <d....@daniel.shahaf.name>
> wrote:
> 
> > Greg Stein wrote on Mon, 07 Aug 2017 18:51 -0500:
> > > The ASF will capture email diffs and push logs of all changes made, to
> > all
> > > branches. These are stored in our email archives and in a push log
> > > database. So provenance might not be stored entirely in the repository,
> > but
> > > we still have all the data (caveat: bugs in our recording).
> >
> > Isn't there a size limit on commit diffs, any commit larger than
> > which isn't fully recorded in the mail archives?
> >
> > Also, trying to _use_ commit emails as a history-digging subject isn't
> > going
> > to be very friendly.  (You can't run log/blame on a mailing list archive)
> >
> 
> Agreed. Friendly isn't a requirement in this case, however.
> 
> Recall that we're talking edge cases here, about proving provenance. And an
> even further edge case is the ASF *defending* provenance. ... With that in
> mind, my court-untested belief is that having half-a-change throws the ball
> to the other side of the court. "Looks like my ICLA-covered peep committed
> this change, and the ICLA says it was Proper. ... Your ball: prove
> otherwise."
> 

IANAL.

> Second, the "only in an email archive" would imply it never got merged to
> master/trunk/develop. Or that it got merged to a deleted release branch.
> (or whatever other scenarios, I don't have in mind) ... Those are even
> further edge cases.
> 

You can make it a lot more of a center case by assuming a project that
supports two major release lines in parallel.  Only one of the two
major lines can be master/develop/trunk, the other is then fair game
for deletes.

> This is where (IMO) we decide the business risk is so freakishly low, that
> we do not require any operational changes to mitigate that risk. (*)

Yes, but the cost of for
archiving every commit object ever seen on *git*.apache.org is _also_
freakishly low.  (The storage cost of commit objects is bounded by some
constant times the typing speed of whoever authored the commit...)

> (*) and even then, we contend with other factors like intent, fraud by ICLA
> signers, downstream mispackaging, or a million other things.

What we do about provenance of honest committers has little effect on
these "other factors".

Cheers,

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Greg Stein <gs...@gmail.com>.
On Tue, Aug 8, 2017 at 10:08 AM, Daniel Shahaf <d....@daniel.shahaf.name>
wrote:

> Greg Stein wrote on Mon, 07 Aug 2017 18:51 -0500:
> > The ASF will capture email diffs and push logs of all changes made, to
> all
> > branches. These are stored in our email archives and in a push log
> > database. So provenance might not be stored entirely in the repository,
> but
> > we still have all the data (caveat: bugs in our recording).
>
> Isn't there a size limit on commit diffs, any commit larger than
> which isn't fully recorded in the mail archives?
>
> Also, trying to _use_ commit emails as a history-digging subject isn't
> going
> to be very friendly.  (You can't run log/blame on a mailing list archive)
>

Agreed. Friendly isn't a requirement in this case, however.

Recall that we're talking edge cases here, about proving provenance. And an
even further edge case is the ASF *defending* provenance. ... With that in
mind, my court-untested belief is that having half-a-change throws the ball
to the other side of the court. "Looks like my ICLA-covered peep committed
this change, and the ICLA says it was Proper. ... Your ball: prove
otherwise."

Second, the "only in an email archive" would imply it never got merged to
master/trunk/develop. Or that it got merged to a deleted release branch.
(or whatever other scenarios, I don't have in mind) ... Those are even
further edge cases.

This is where (IMO) we decide the business risk is so freakishly low, that
we do not require any operational changes to mitigate that risk. (*)

Cheers,
-g

(*) and even then, we contend with other factors like intent, fraud by ICLA
signers, downstream mispackaging, or a million other things.

Re: Git and Provenance

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Greg Stein wrote on Mon, 07 Aug 2017 18:51 -0500:
> The ASF will capture email diffs and push logs of all changes made, to all
> branches. These are stored in our email archives and in a push log
> database. So provenance might not be stored entirely in the repository, but
> we still have all the data (caveat: bugs in our recording).

Isn't there a size limit on commit diffs, any commit larger than
which isn't fully recorded in the mail archives?

Also, trying to _use_ commit emails as a history-digging subject isn't going
to be very friendly.  (You can't run log/blame on a mailing list archive)

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Mon, Aug 7, 2017 at 4:51 PM, Greg Stein <gs...@gmail.com> wrote:
> Hey all,
>
> Just wanted to confirm my thoughts about provenance, as recorded (or
> deleted!) by the git tool.
>
> The short answer is that git, unlike svn, allows a PMC to remove certain
> types of development history. The master/develop branch cannot be modified
> (generally), but any development that occurs on a branch can be lost.
>
> At least a couple forms of loss that I can think of:
>
> 1) a series of commits to a branch are "squashed" into a single commit, then
> merged to master. then, the branch is deleted. we no longer have the
> individual commits.
>
> 2) a branch is used to construct a release, and is later deleted.
>
> There are likely other scenarios, but having even one is enough for my
> query/discussion.
>
> The ASF will capture email diffs and push logs of all changes made, to all
> branches. These are stored in our email archives and in a push log database.
> So provenance might not be stored entirely in the repository, but we still
> have all the data (caveat: bugs in our recording).
>
> I believe this is sufficient for the Foundation's needs.
>
> Given the various bits above, the Infrastructure Team doesn't have any plans
> to change things, but I felt it best to confirm/ask.
>
> Thanks,
> Greg Stein
> Infrastructure Administrator, ASF
>
>
> ps. IMO, we have very little need for provenance/history; in actuality, this
> seems more like data for downstream users, in the event of (legal) trouble

+1 to the above -- I agree with your thinking and analysis. This also matches
the understanding of THE thread on Git that led to a lot of policy we
have today:
    https://lists.apache.org/thread.html/1e25b62f2bee2ea7ad33b5cbb79b1ae95f933f57eda1a7ee724cfacb@1448058847@%3Cboard.apache.org%3E

Thanks,
Roman.

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Re: Git and Provenance

Posted by Chris Mattmann <ma...@apache.org>.
Greg,

 

The below is also my understanding and I believe it is correct. In short, keep doing what you are
doing and I have no concerns from Legal Committee’s perspective.

 

Cheers,

Chris

 

 

 

 

From: Greg Stein <gs...@gmail.com>
Reply-To: "legal-discuss@apache.org" <le...@apache.org>
Date: Monday, August 7, 2017 at 4:51 PM
To: "legal-discuss@apache.org" <le...@apache.org>
Subject: Git and Provenance

 

Hey all, 

 

Just wanted to confirm my thoughts about provenance, as recorded (or deleted!) by the git tool.

 

The short answer is that git, unlike svn, allows a PMC to remove certain types of development history. The master/develop branch cannot be modified (generally), but any development that occurs on a branch can be lost.

 

At least a couple forms of loss that I can think of:

 

1) a series of commits to a branch are "squashed" into a single commit, then merged to master. then, the branch is deleted. we no longer have the individual commits.

 

2) a branch is used to construct a release, and is later deleted.

 

There are likely other scenarios, but having even one is enough for my query/discussion.

 

The ASF will capture email diffs and push logs of all changes made, to all branches. These are stored in our email archives and in a push log database. So provenance might not be stored entirely in the repository, but we still have all the data (caveat: bugs in our recording).

 

I believe this is sufficient for the Foundation's needs.

 

Given the various bits above, the Infrastructure Team doesn't have any plans to change things, but I felt it best to confirm/ask.

 

Thanks,

Greg Stein

Infrastructure Administrator, ASF

 

 

ps. IMO, we have very little need for provenance/history; in actuality, this seems more like data for downstream users, in the event of (legal) trouble