You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@netbeans.apache.org by Emilian Bold <em...@apache.org> on 2016/10/14 11:06:55 UTC

Shallow git clones as an option to a split repository

Hello,

I've recently learned git allows 'shallow' clones that may contain no
history whatsoever.

See the git clone manual <https://git-scm.com/docs/git-clone>, specifically
the --depth parameter.

Obviously this will be a huge bandwidth, time and disk saver for some
people.

And it seems that git even supports push / pull from shallow repositories.

I believe this would permit us to still use a single unaltered repository
while allowing users (or GitHub mirrors) to be shallow.

PS: Philosophically speaking, I see all this discussion about repository
size and history stripping as a failure of DVCS
<https://en.wikipedia.org/wiki/Distributed_version_control>s and/or of the
Internet infrastructure. Removing history is the equivalent of removing
comments to save disk space.

Re: Shallow git clones as an option to a split repository

Posted by Emilian Bold <em...@gmail.com>.
@Bruno see the thread "[PROPOSAL] Split the main NetBeans repo" which
discusses splitting the repo per clusters. It's a pretty good idea and the
history loss is minimal. Probably the way we are going to go.


--emi

On Mon, Oct 17, 2016 at 10:08 PM, Bruno Souza <br...@javaman.com.br> wrote:

> I think the issue should not be keeping the history or not... I think
> clearly the BEST thing is to keep the history. Even if you never touch it,
> you still need the history for attribution at least!
>
> The issue is: things are too big. They are not big because Git can't handle
> big things, but maybe too big because it was centralized and run as a
> single thing in a single company. Maybe it makes sense to start separating
> things a bit, to make easier for others to join in!
>
> NetBeans is a HUGE codebase, but, it is also very modular! I'm sure we
> could separate things in their own repositories and that alone would make
> things easier to others to contribute, and even reuse in other Apache
> projects.
>
> So, instead of discussing the size or the download, wouldn't be a more
> valid discussion to see if there is a reasonably way to split NetBeans into
> a (small) set of meaningful repositories?
>
> Git has the concept of "submodules" and also "subtrees". There is even a
> "sub-repo" command[1] that improves on both ideas. Any of those would allow
> to include "sub-repositories" inside a main repository. So, lets say we
> divided NetBeans on the Java "package" level, we could still have a
> "NetBeans" repository, that would reference the whole codebase as a single
> "thing", but most of the project would actually be handled in the
> sub-projects.
>
> Can this be a doable option?
>
> Cheers!
> Bruno.
>
> [1] https://github.com/ingydotnet/git-subrepo#readme
>
> Bruno.
> ______________________________________________________________________
> Bruno Peres Ferreira de Souza                         Brazil's JavaMan
> http://www.javaman.com.br                      bruno at javaman.com.br
>      if I fail, if I succeed, at least I live as I believe
>
>
> On Fri, Oct 14, 2016 at 9:55 AM, Wade Chandler <co...@wadechandler.com>
> wrote:
>
> >
> > > On Oct 14, 2016, at 07:06, Emilian Bold <em...@apache.org> wrote:
> > >
> > > Hello,
> > >
> > > I've recently learned git allows 'shallow' clones that may contain no
> > > history whatsoever.
> > >
> > > See the git clone manual <https://git-scm.com/docs/git-clone>,
> > specifically
> > > the --depth parameter.
> > >
> > > Obviously this will be a huge bandwidth, time and disk saver for some
> > > people.
> > >
> >
> > I agree shallow git clones are great. I think I would use them even with
> > smaller repos until I needed to know more.
> >
> > > And it seems that git even supports push / pull from shallow
> > repositories.
> > >
> > > I believe this would permit us to still use a single unaltered
> repository
> > > while allowing users (or GitHub mirrors) to be shallow.
> > >
> >
> > Yes, but then the whole is much larger still. The repository is 1GB just
> > for the sources. If I’m working on Groovy, Java, and Core, then I don’t
> > need PHP, C/C++, or others, and frankly they are out of context in that
> > case. I think perhaps as a start we look at how to get moved over, but of
> > course have to be able to put it in the infra regardless of thoughts on
> > this, and then figure out something. i.e. it isn’t scalable IMO that
> > everyone working on every technology has to contribute and merge up with
> > everyone else working on other technologies unless they are actually
> > changing some central thing.
> >
> > > PS: Philosophically speaking, I see all this discussion about
> repository
> > > size and history stripping as a failure of DVCS
> > > <https://en.wikipedia.org/wiki/Distributed_version_control>s and/or of
> > the
> > > Internet infrastructure. Removing history is the equivalent of removing
> > > comments to save disk space.
> >
> > I don’t think that last statement is necessarily accurate. I mean, if a
> > file has so many changes those old depths are irrelevant and useless,
> then
> > what meaning do they have? It is hard to make a case they are useful
> after
> > some time. To me it is like keeping too much stuff in the house because
> we
> > are afraid to get rid of it. If you will never touch it, does it have any
> > meaning? You might keep something, and some time down the road you go
> “Man,
> > if I had that I could have made 10,000 bucks!”, but then if you had sold
> > off old stuff and saved the money as you went through life, you probably
> > would have had more money instantly available. But, the rare times you
> had
> > that 10,000 dollar time laying around were probably so rare you can’t
> > remember them or never had them. Maybe a bad analogy, but I think there
> is
> > still a point when history is just stale, and even if slightly useful,
> not
> > much due to the complication of its relevance to “now” at any point in
> > time; the bigger the depth of a files history, the bigger the complexity
> > between depth N and depth 1IMO.
> >
> > On the DVCS stuff, I don’t know. It is like the “cloud”. Smaller things
> > just scale better until not only disk space but bandwidth gets cheaper
> and
> > more available. Even in large networks like AWS smaller drives scale
> better
> > for problems where as bigger ones don’t because you are dealing with so
> > many connections and data pools. Even if we were using SVN, then if we
> > depended on pulling down all C++, Python, PHP, Java, Groovy, etc just to
> > work on say JavaScript, and if those things made the pull over 1GB, I
> think
> > the same problem would exist, and personally I don’t find it practical.
> So,
> > I see it as a problem of structure versus as much a problem with the
> > technology…at least until we have quantum SSDs and quantum entanglement
> > driven networks :-D
> >
> > Wade
>

Re: Shallow git clones as an option to a split repository

Posted by Bruno Souza <br...@javaman.com.br>.
I think the issue should not be keeping the history or not... I think
clearly the BEST thing is to keep the history. Even if you never touch it,
you still need the history for attribution at least!

The issue is: things are too big. They are not big because Git can't handle
big things, but maybe too big because it was centralized and run as a
single thing in a single company. Maybe it makes sense to start separating
things a bit, to make easier for others to join in!

NetBeans is a HUGE codebase, but, it is also very modular! I'm sure we
could separate things in their own repositories and that alone would make
things easier to others to contribute, and even reuse in other Apache
projects.

So, instead of discussing the size or the download, wouldn't be a more
valid discussion to see if there is a reasonably way to split NetBeans into
a (small) set of meaningful repositories?

Git has the concept of "submodules" and also "subtrees". There is even a
"sub-repo" command[1] that improves on both ideas. Any of those would allow
to include "sub-repositories" inside a main repository. So, lets say we
divided NetBeans on the Java "package" level, we could still have a
"NetBeans" repository, that would reference the whole codebase as a single
"thing", but most of the project would actually be handled in the
sub-projects.

Can this be a doable option?

Cheers!
Bruno.

[1] https://github.com/ingydotnet/git-subrepo#readme

Bruno.
______________________________________________________________________
Bruno Peres Ferreira de Souza                         Brazil's JavaMan
http://www.javaman.com.br                      bruno at javaman.com.br
     if I fail, if I succeed, at least I live as I believe


On Fri, Oct 14, 2016 at 9:55 AM, Wade Chandler <co...@wadechandler.com>
wrote:

>
> > On Oct 14, 2016, at 07:06, Emilian Bold <em...@apache.org> wrote:
> >
> > Hello,
> >
> > I've recently learned git allows 'shallow' clones that may contain no
> > history whatsoever.
> >
> > See the git clone manual <https://git-scm.com/docs/git-clone>,
> specifically
> > the --depth parameter.
> >
> > Obviously this will be a huge bandwidth, time and disk saver for some
> > people.
> >
>
> I agree shallow git clones are great. I think I would use them even with
> smaller repos until I needed to know more.
>
> > And it seems that git even supports push / pull from shallow
> repositories.
> >
> > I believe this would permit us to still use a single unaltered repository
> > while allowing users (or GitHub mirrors) to be shallow.
> >
>
> Yes, but then the whole is much larger still. The repository is 1GB just
> for the sources. If I’m working on Groovy, Java, and Core, then I don’t
> need PHP, C/C++, or others, and frankly they are out of context in that
> case. I think perhaps as a start we look at how to get moved over, but of
> course have to be able to put it in the infra regardless of thoughts on
> this, and then figure out something. i.e. it isn’t scalable IMO that
> everyone working on every technology has to contribute and merge up with
> everyone else working on other technologies unless they are actually
> changing some central thing.
>
> > PS: Philosophically speaking, I see all this discussion about repository
> > size and history stripping as a failure of DVCS
> > <https://en.wikipedia.org/wiki/Distributed_version_control>s and/or of
> the
> > Internet infrastructure. Removing history is the equivalent of removing
> > comments to save disk space.
>
> I don’t think that last statement is necessarily accurate. I mean, if a
> file has so many changes those old depths are irrelevant and useless, then
> what meaning do they have? It is hard to make a case they are useful after
> some time. To me it is like keeping too much stuff in the house because we
> are afraid to get rid of it. If you will never touch it, does it have any
> meaning? You might keep something, and some time down the road you go “Man,
> if I had that I could have made 10,000 bucks!”, but then if you had sold
> off old stuff and saved the money as you went through life, you probably
> would have had more money instantly available. But, the rare times you had
> that 10,000 dollar time laying around were probably so rare you can’t
> remember them or never had them. Maybe a bad analogy, but I think there is
> still a point when history is just stale, and even if slightly useful, not
> much due to the complication of its relevance to “now” at any point in
> time; the bigger the depth of a files history, the bigger the complexity
> between depth N and depth 1IMO.
>
> On the DVCS stuff, I don’t know. It is like the “cloud”. Smaller things
> just scale better until not only disk space but bandwidth gets cheaper and
> more available. Even in large networks like AWS smaller drives scale better
> for problems where as bigger ones don’t because you are dealing with so
> many connections and data pools. Even if we were using SVN, then if we
> depended on pulling down all C++, Python, PHP, Java, Groovy, etc just to
> work on say JavaScript, and if those things made the pull over 1GB, I think
> the same problem would exist, and personally I don’t find it practical. So,
> I see it as a problem of structure versus as much a problem with the
> technology…at least until we have quantum SSDs and quantum entanglement
> driven networks :-D
>
> Wade

Re: Shallow git clones as an option to a split repository

Posted by Wade Chandler <co...@wadechandler.com>.
> On Oct 14, 2016, at 07:06, Emilian Bold <em...@apache.org> wrote:
> 
> Hello,
> 
> I've recently learned git allows 'shallow' clones that may contain no
> history whatsoever.
> 
> See the git clone manual <https://git-scm.com/docs/git-clone>, specifically
> the --depth parameter.
> 
> Obviously this will be a huge bandwidth, time and disk saver for some
> people.
> 

I agree shallow git clones are great. I think I would use them even with smaller repos until I needed to know more.

> And it seems that git even supports push / pull from shallow repositories.
> 
> I believe this would permit us to still use a single unaltered repository
> while allowing users (or GitHub mirrors) to be shallow.
> 

Yes, but then the whole is much larger still. The repository is 1GB just for the sources. If I’m working on Groovy, Java, and Core, then I don’t need PHP, C/C++, or others, and frankly they are out of context in that case. I think perhaps as a start we look at how to get moved over, but of course have to be able to put it in the infra regardless of thoughts on this, and then figure out something. i.e. it isn’t scalable IMO that everyone working on every technology has to contribute and merge up with everyone else working on other technologies unless they are actually changing some central thing.

> PS: Philosophically speaking, I see all this discussion about repository
> size and history stripping as a failure of DVCS
> <https://en.wikipedia.org/wiki/Distributed_version_control>s and/or of the
> Internet infrastructure. Removing history is the equivalent of removing
> comments to save disk space.

I don’t think that last statement is necessarily accurate. I mean, if a file has so many changes those old depths are irrelevant and useless, then what meaning do they have? It is hard to make a case they are useful after some time. To me it is like keeping too much stuff in the house because we are afraid to get rid of it. If you will never touch it, does it have any meaning? You might keep something, and some time down the road you go “Man, if I had that I could have made 10,000 bucks!”, but then if you had sold off old stuff and saved the money as you went through life, you probably would have had more money instantly available. But, the rare times you had that 10,000 dollar time laying around were probably so rare you can’t remember them or never had them. Maybe a bad analogy, but I think there is still a point when history is just stale, and even if slightly useful, not much due to the complication of its relevance to “now” at any point in time; the bigger the depth of a files history, the bigger the complexity between depth N and depth 1IMO.

On the DVCS stuff, I don’t know. It is like the “cloud”. Smaller things just scale better until not only disk space but bandwidth gets cheaper and more available. Even in large networks like AWS smaller drives scale better for problems where as bigger ones don’t because you are dealing with so many connections and data pools. Even if we were using SVN, then if we depended on pulling down all C++, Python, PHP, Java, Groovy, etc just to work on say JavaScript, and if those things made the pull over 1GB, I think the same problem would exist, and personally I don’t find it practical. So, I see it as a problem of structure versus as much a problem with the technology…at least until we have quantum SSDs and quantum entanglement driven networks :-D

Wade