You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Michael Park (JIRA)" <ji...@apache.org> on 2017/11/02 10:49:00 UTC

[jira] [Commented] (MESOS-8162) Binary data causes bloat in the git repository

    [ https://issues.apache.org/jira/browse/MESOS-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235521#comment-16235521 ] 

Michael Park commented on MESOS-8162:
-------------------------------------

I experimented with cleaning out these old binary files out of the repository with [BFG|https://rtyley.github.io/bfg-repo-cleaner/].
Documenting some of the steps I took and the results I found:

[mpark/mesos|https://github.com/mpark/mesos] is a fork of [apache/mesos|https://github.com/apache/mesos], and cloning the {{apache/mesos}} repository
via {{git clone git@github.com:apache/mesos.git}} results in a *401M* directory.

I ran the following commands:
{code}
git clone --mirror git@github.com:mpark/mesos.git mesos-strip
bfg -b 5M -p master,1.0.0,1.0.1,1.0.2,1.0.3,1.0.4,1.1.0,1.1.1,1.1.2,1.2.0,1.2.1,1.2.2,1.2.x,1.3.0,1.3.1,1.3.x,1.4.0,1.4.x
git push
{code}

I then cloned the {{mpark/mesos}} repository again via {{git clone git@github.com:apache/mesos.git}},
and this results in a *243M* directory.

I think the biggest risk is that since we're rewriting history, virtually all of the commits get a new commit id.
I'm not exactly sure what problem we would run into, but it just feels disruptive. On the other hand,
the new commit message contains the old commit id, so it may not be all that much of a problem.

After the {{bfg}} command above, one of the things it says is:

{noformat}
Deleted files
-------------

	Filename                   Git id
	-----------------------------------------------------------------
	boost-1.51.0.tar.gz      | e461b8a4 (6.9 MB)
	grpc-1.4.2.tar.gz        | f4dfe636 (6.1 MB)
	hadoop-0.20.205.0.tar.gz | bc605a36 (93.5 MB)
	protobuf-3.2.0.tar.gz    | 3a212180 (6.5 MB), 6e9bfbfa (6.5 MB)
	protobuf-3.3.0.tar.gz    | 98fbec86 (6.7 MB)
	uming.ttc                | 2042560c (20.1 MB), 72dca440 (20.1 MB)
	zookeeper-3.3.1.tar.gz   | c67deed3 (9.5 MB)
	zookeeper-3.3.4.tar.gz   | 09d49240 (12.9 MB)
	zookeeper-3.3.6.tar.gz   | 5588107a (11.3 MB)
	zookeeper-3.4.5.tar.gz   | 1a547fe1 (15.6 MB)
	zookeeper-3.4.8.tar.gz   | a23d68be (21.2 MB)

In total, 34339 object ids were changed.
{noformat}

> Binary data causes bloat in the git repository
> ----------------------------------------------
>
>                 Key: MESOS-8162
>                 URL: https://issues.apache.org/jira/browse/MESOS-8162
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Michael Park
>
> Since Git doesn't know how to handle binary files all that well, the way in which
> the {{3rdparty}} directory is managed continues to bloat the size of our repository.
> There is a ~100M hadoop from a long time ago that's still stored, a few ~20M
> each of older versions of Zookeeper, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)