You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Stein <gs...@gmail.com> on 2011/05/25 21:49:39 UTC

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>...
> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> +     corresponding svn_revnum_t * target repository revisions. */
> +  apr_hash_t *rev_map;

How big can this grow? ie. what happens when there are several million
revisions.

Cheers,
-g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 05/25/2011 05:41 PM, Daniel Shahaf wrote:
> Daniel Shahaf wrote on Thu, May 26, 2011 at 00:36:48 +0300:
>> I've not read the code, but would an array of 'svn_revnum_t[2]' be
>> a better representation?
>>
>> Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
>> by from_rev.  That's less overhead (and we could take advantage of the
>> sorting to store less data), at the cost of O(log(n)) lookup.
>>
> 
> And the numbers... well, for a plain C array it would be 70% less than
> in Greg's analysis of the hash case, since the additional 20 bytes per
> mapping entry are gone.   (and that's before I suggest to not store
> [N+1, M+1] if [N,M] is already stored)

Great ideas.  I've filed issue #3903 to track this enhancement.  Not, IMO, a
1.7 blocker (any more than it was a 1.0 blocker).

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Daniel Shahaf wrote on Thu, May 26, 2011 at 00:36:48 +0300:
> I've not read the code, but would an array of 'svn_revnum_t[2]' be
> a better representation?
> 
> Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
> by from_rev.  That's less overhead (and we could take advantage of the
> sorting to store less data), at the cost of O(log(n)) lookup.
> 

And the numbers... well, for a plain C array it would be 70% less than
in Greg's analysis of the hash case, since the additional 20 bytes per
mapping entry are gone.   (and that's before I suggest to not store
[N+1, M+1] if [N,M] is already stored)

> Greg Stein wrote on Wed, May 25, 2011 at 17:21:44 -0400:
> > On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> > > On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> > >> On 05/25/2011 03:49 PM, Greg Stein wrote:
> > >>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
> > >>>> ...
> > >>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> > >>>> +     corresponding svn_revnum_t * target repository revisions. */
> > >>>> +  apr_hash_t *rev_map;
> > >>>
> > >>> How big can this grow? ie. what happens when there are several million
> > >>> revisions.
> > >>
> > >> It gets big.  (This logic and approach are copied from 'svnadmin load',
> > >> which doesn't excuse it, but might explain it.)
> > >
> > > Actually, I don't really know for sure how big it gets.  It's a mapping of
> > > of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> > > internals.  Anybody have any guesses?
> > 
> > struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
> > (4 bytes each), and you get 28 bytes for each *used* entry.
> > 
> > Now we also have to account for unused entries. APR has a pretty poor
> > hash table implementation. It allocates *upwards* to the nearest power
> > of two. So the internal size will grow like:
> > 
> > 1048576
> > 2097152
> > 4194304
> > 
> > One saving grace is that APR only grows when the entry count matches
> > the internal table size. It uses a "closed hash" algorithm with linked
> > lists at each bucket, so the actual load on the buckets is not
> > possible to compute. The hand-wave means that you can put in 4 million
> > mappings before it grows it up to 8 million buckets.
> > 
> > So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
> > Each mapping will add another 28 bytes. So: 4 million mappings is
> > about 134 megabytes. But also recognize that *reaching* that point
> > will use and toss approx the same amount of memory. So about 260 meg
> > total.
> > 
> > On a 64-bit architecture, all these values are likely to be doubled.
> > 
> > Not a machine crusher, in retrospect. But not exactly a winner either.
> > 
> > Cheers,
> > -g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

I've not read the code, but would an array of 'svn_revnum_t[2]' be
a better representation?

Specifically: an append-only array of [from_rev, to_rev] pairs, sorted
by from_rev.  That's less overhead (and we could take advantage of the
sorting to store less data), at the cost of O(log(n)) lookup.

Greg Stein wrote on Wed, May 25, 2011 at 17:21:44 -0400:
> On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> > On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> >> On 05/25/2011 03:49 PM, Greg Stein wrote:
> >>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
> >>>> ...
> >>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
> >>>> +     corresponding svn_revnum_t * target repository revisions. */
> >>>> +  apr_hash_t *rev_map;
> >>>
> >>> How big can this grow? ie. what happens when there are several million
> >>> revisions.
> >>
> >> It gets big.  (This logic and approach are copied from 'svnadmin load',
> >> which doesn't excuse it, but might explain it.)
> >
> > Actually, I don't really know for sure how big it gets.  It's a mapping of
> > of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> > internals.  Anybody have any guesses?
> 
> struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
> (4 bytes each), and you get 28 bytes for each *used* entry.
> 
> Now we also have to account for unused entries. APR has a pretty poor
> hash table implementation. It allocates *upwards* to the nearest power
> of two. So the internal size will grow like:
> 
> 1048576
> 2097152
> 4194304
> 
> One saving grace is that APR only grows when the entry count matches
> the internal table size. It uses a "closed hash" algorithm with linked
> lists at each bucket, so the actual load on the buckets is not
> possible to compute. The hand-wave means that you can put in 4 million
> mappings before it grows it up to 8 million buckets.
> 
> So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
> Each mapping will add another 28 bytes. So: 4 million mappings is
> about 134 megabytes. But also recognize that *reaching* that point
> will use and toss approx the same amount of memory. So about 260 meg
> total.
> 
> On a 64-bit architecture, all these values are likely to be doubled.
> 
> Not a machine crusher, in retrospect. But not exactly a winner either.
> 
> Cheers,
> -g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by Greg Stein <gs...@gmail.com>.

On Wed, May 25, 2011 at 16:08, C. Michael Pilato <cm...@collab.net> wrote:
> On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
>> On 05/25/2011 03:49 PM, Greg Stein wrote:
>>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>>>> ...
>>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>>>> +     corresponding svn_revnum_t * target repository revisions. */
>>>> +  apr_hash_t *rev_map;
>>>
>>> How big can this grow? ie. what happens when there are several million
>>> revisions.
>>
>> It gets big.  (This logic and approach are copied from 'svnadmin load',
>> which doesn't excuse it, but might explain it.)
>
> Actually, I don't really know for sure how big it gets.  It's a mapping of
> of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
> internals.  Anybody have any guesses?

struct apr_hash_entry_t is generally 20 bytes. Add in the two revnums
(4 bytes each), and you get 28 bytes for each *used* entry.

Now we also have to account for unused entries. APR has a pretty poor
hash table implementation. It allocates *upwards* to the nearest power
of two. So the internal size will grow like:

1048576
2097152
4194304

One saving grace is that APR only grows when the entry count matches
the internal table size. It uses a "closed hash" algorithm with linked
lists at each bucket, so the actual load on the buckets is not
possible to compute. The hand-wave means that you can put in 4 million
mappings before it grows it up to 8 million buckets.

So... 4 million buckets (pointers) at 4 bytes each is 80 megabytes.
Each mapping will add another 28 bytes. So: 4 million mappings is
about 134 megabytes. But also recognize that *reaching* that point
will use and toss approx the same amount of memory. So about 260 meg
total.

On a 64-bit architecture, all these values are likely to be doubled.

Not a machine crusher, in retrospect. But not exactly a winner either.

Cheers,
-g

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 05/25/2011 04:05 PM, C. Michael Pilato wrote:
> On 05/25/2011 03:49 PM, Greg Stein wrote:
>> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>>> ...
>>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>>> +     corresponding svn_revnum_t * target repository revisions. */
>>> +  apr_hash_t *rev_map;
>>
>> How big can this grow? ie. what happens when there are several million
>> revisions.
> 
> It gets big.  (This logic and approach are copied from 'svnadmin load',
> which doesn't excuse it, but might explain it.)

Actually, I don't really know for sure how big it gets.  It's a mapping of
of sizeof(svn_revnum_t) to sizeof(svn_revnum_t), plus all the hash
internals.  Anybody have any guesses?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: svn commit: r1127646 - in /subversion/trunk/subversion: svnrdump/load_editor.c tests/cmdline/svnrdump_tests.py

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 05/25/2011 03:49 PM, Greg Stein wrote:
> On Wed, May 25, 2011 at 15:33,  <cm...@apache.org> wrote:
>> ...
>> +  /* A mapping of svn_revnum_t * dump stream revisions to their
>> +     corresponding svn_revnum_t * target repository revisions. */
>> +  apr_hash_t *rev_map;
> 
> How big can this grow? ie. what happens when there are several million
> revisions.

It gets big.  (This logic and approach are copied from 'svnadmin load',
which doesn't excuse it, but might explain it.)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand