You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Josh Pieper <jj...@pobox.com> on 2004/12/13 14:21:11 UTC

Re: crash managing a large FSFS repository

Chase Phillips wrote:
> As a follow-up to my thread "svn resource requirements with large
> repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
> was recently able to try out the same procedure with revision 12289 from
> Subversion's trunk.  With this revision I experience the same resource
> usage issues that led me to raise this issue at first.
> 
> As a refresher, our project is developing a software image that runs on
> top of NetBSD.  We need to store NetBSD in a revision-controlled
> environment to track the changes we make to the operating system and
> kernel.  I decided to create this new repository locally on the disk in
> FSFS format (our current repo that stores application source is in BDB
> format).
> 
> After importing the NetBSD source code and then copying it onto a branch,
> a subsequent checkout leads to a core dump.  I've attached one of the
> stack traces from my two attempts to this email (each attempt takes
> upwards of 15 minutes before svn dumps core).  The second stack trace
> differs from the first only in memory addresses of variables, though it
> can be sent as well if needed.

I just attempted this experiment using a local FSFS repository and
what I think is the entirety of the NetBSD source tree.  Import was
successful after 42 minutes wall time.  Maximum memory usage during
the majority of the import was 19M, however, after the final revision
file was moved into place, memory usage spiked to 44M.  It could just
be the recursive delete of the completed transaction directory.  It
had around 120,000 files in it after finalization.

Checkout was successful too, after 43 minutes wall time.  Memory usage
climbed steadily throughout the entire checkout, peaking at 85M.  I
didn't remember that there was a component of checkout/update that was
supposed to be linear in memory?

Just for record, the NetBSD sources I used totaled 973 megabytes, with
95,900 files or so.  The checked out working copy totaled 3,336
megabytes in size.  The experiments were on my relatively unloaded
Athlon XP 1900 with a 5400 rpm 80G hard drive.

-Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

kfogel@collab.net wrote:

>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg.  Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes.  I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236.  Plus we need a
>4-byte pointer per entry...
>
>  
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup 
table needn't be a hash table; a binary search table may be better, 
especially if  the input data is mostly sorted.  That could bring the 
overhead down to ~4 bytes.

>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising.  Maybe I'm missing something, though?
>  
>
Reflexive skepticism is what keeps us alive :)  There may also be 
interactions with the pool allocator;  I do think that path length 
explains at least 25% of the memory growth.  I think it's time to run a 
profiler and see where the memory is going (but that spoils the fun).

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

kfogel@collab.net wrote:

>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg.  Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes.  I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236.  Plus we need a
>4-byte pointer per entry...
>
>  
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup 
table needn't be a hash table; a binary search table may be better, 
especially if  the input data is mostly sorted.  That could bring the 
overhead down to ~4 bytes.

>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising.  Maybe I'm missing something, though?
>  
>
Reflexive skepticism is what keeps us alive :)  There may also be 
interactions with the pool allocator;  I do think that path length 
explains at least 25% of the memory growth.  I think it's time to run a 
profiler and see where the memory is going (but that spoils the fun).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.

Simon Spero <se...@unc.edu> writes:
> Noise sources :
>     Original report is for  a memory spike from 19Mb ->  44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is  always a power of two; hash node size is ~20
> bytes.
>    First metric was to run  find . >/tmp/find-netbsd.   Total  size is
> 8,139,654 Bytes. (wc -c)
>     Number of entries:  193,716  (wc -l)
>     Average path length: ~42 bytes
>     Measurements were made relative to '.' ;  paths in memory would be
> relative to the root of the repository.  Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB  in this case)

Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB

That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?

> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
>     Total size: 1,654,088
>     Number of entries:  193,716  (wc -l)
>     Average size: 8 bytes

(What was this metric for?)

This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing.  Hmmm.  I confess I'm not sure what to think here.

> Third metric: size of interned path-name components (sed ... | sort | uniq)
>     Total size: 577,432
>     Number of unique strings: 51,236

(This was for estimating if we used a tree-style data structure, right?)

So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg.  Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes.  I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236.  Plus we need a
4-byte pointer per entry...

So, is it really looking so much better than 9 MB, in the long run?

I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising.  Maybe I'm
missing something, though?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.

Simon Spero <se...@unc.edu> writes:
> Noise sources :
>     Original report is for  a memory spike from 19Mb ->  44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is  always a power of two; hash node size is ~20
> bytes.
>    First metric was to run  find . >/tmp/find-netbsd.   Total  size is
> 8,139,654 Bytes. (wc -c)
>     Number of entries:  193,716  (wc -l)
>     Average path length: ~42 bytes
>     Measurements were made relative to '.' ;  paths in memory would be
> relative to the root of the repository.  Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB  in this case)

Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB

That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?

> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
>     Total size: 1,654,088
>     Number of entries:  193,716  (wc -l)
>     Average size: 8 bytes

(What was this metric for?)

This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing.  Hmmm.  I confess I'm not sure what to think here.

> Third metric: size of interned path-name components (sed ... | sort | uniq)
>     Total size: 577,432
>     Number of unique strings: 51,236

(This was for estimating if we used a tree-style data structure, right?)

So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg.  Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes.  I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236.  Plus we need a
4-byte pointer per entry...

So, is it really looking so much better than 9 MB, in the long run?

I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising.  Maybe I'm
missing something, though?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

kfogel@collab.net wrote:

>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>    
>>
>
>is both true and the cause of our problems.  I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>  
>
   
A few rough estimators:

The original problem report was for problems importing the NetBSD source 
tree, so I unpacked the  files from the NetBSD 2.0 source iso.   
Original report is for fewer files  (~120,000) , but we're just doing 
big O here.

Noise sources :
    Original report is for  a memory spike from 19Mb ->  44Mb, so 
results on the order of megabytes are possibly significant. 
    Hashtable array size is  always a power of two; hash node size is 
~20 bytes.
   

First metric was to run  find . >/tmp/find-netbsd. 
    Total  size is 8,139,654 Bytes. (wc -c)
    Number of entries:  193,716  (wc -l)
    Average path length: ~42 bytes
    Measurements were made relative to '.' ;  paths in memory would be 
relative to the root of the repository.  Adding /trunk/ to start of each 
path would use an extra 6 chars per entry (~1.1MB  in this case)

Second metric is to strip out everthing but the last name component:  ( 
sed -e 's;^.*/;;' )
    Total size: 1,654,088
    Number of entries:  193,716  (wc -l)
    Average size: 8 bytes

Third metric: size of interned path-name components (sed ... | sort | uniq)
    Total size: 577,432
    Number of unique strings: 51,236

Bonus  metric: estimate entropy using  bzip2 -9 -v
    Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
    Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
    Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

kfogel@collab.net wrote:

>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>    
>>
>
>is both true and the cause of our problems.  I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>  
>
   
A few rough estimators:

The original problem report was for problems importing the NetBSD source 
tree, so I unpacked the  files from the NetBSD 2.0 source iso.   
Original report is for fewer files  (~120,000) , but we're just doing 
big O here.

Noise sources :
    Original report is for  a memory spike from 19Mb ->  44Mb, so 
results on the order of megabytes are possibly significant. 
    Hashtable array size is  always a power of two; hash node size is 
~20 bytes.
   

First metric was to run  find . >/tmp/find-netbsd. 
    Total  size is 8,139,654 Bytes. (wc -c)
    Number of entries:  193,716  (wc -l)
    Average path length: ~42 bytes
    Measurements were made relative to '.' ;  paths in memory would be 
relative to the root of the repository.  Adding /trunk/ to start of each 
path would use an extra 6 chars per entry (~1.1MB  in this case)

Second metric is to strip out everthing but the last name component:  ( 
sed -e 's;^.*/;;' )
    Total size: 1,654,088
    Number of entries:  193,716  (wc -l)
    Average size: 8 bytes

Third metric: size of interned path-name components (sed ... | sort | uniq)
    Total size: 577,432
    Number of unique strings: 51,236

Bonus  metric: estimate entropy using  bzip2 -9 -v
    Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
    Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
    Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.

Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such;  if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
> 
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length.  For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.

This is how the Subversion repository itself is structured, actually.

The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed().  We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).  

We'd also want to write functions for accessing the tree structure,
for example:

   svn_error_t *
   svn_tree_has_path (svn_boolean_t *has_path,
                      svn_tree_t *tree,
                      const char *path);

Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion

> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.

is both true and the cause of our problems.  I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.

Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such;  if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
> 
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length.  For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.

This is how the Subversion repository itself is structured, actually.

The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed().  We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).  

We'd also want to write functions for accessing the tree structure,
for example:

   svn_error_t *
   svn_tree_has_path (svn_boolean_t *has_path,
                      svn_tree_t *tree,
                      const char *path);

Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion

> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.

is both true and the cause of our problems.  I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

Eric Gillespie wrote:

>That's exactly what it is.  It was much, much worse until r11701 and r11706.  However, fs_fs.c:fetch_all_changes still builds a giant hash in memory.  I wasn't sure what to do about this, and so left it alone.  I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>  
>
I've been looking at scaling issues with fs_fs, but mostly looking at 
repository size related issues.  This issue is isolated to individual 
transactions, so it's simpler to fix and test.

At the moment the code uses memory roughly proportional to the total 
lengths of all paths in the transactions.

One approach to reducing the amount of memory needed would be to use a 
data structure that models directories, rather than complete paths.   
Each directory node should have its own lookup table; the keys can be 
just the name of the immediate child relative to this node.  
Intermediate nodes for path components that haven't been seen themselves 
should be marked as such;  if the path is later explicitly encountered, 
the mark can be cleared (or vice versa).

This approach requires space roughly proportional to the number of  
directories and files in the transaction, rather than total path 
length.  For big, flat namespaces, this isn't much of a win, but it also 
isn't much worse; as the name space gets deeper, and closer to real 
source repositories, the win gets bigger. This approach also makes it 
faster to determine parent/child relationships.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.

Eric Gillespie wrote:

>That's exactly what it is.  It was much, much worse until r11701 and r11706.  However, fs_fs.c:fetch_all_changes still builds a giant hash in memory.  I wasn't sure what to do about this, and so left it alone.  I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>  
>
I've been looking at scaling issues with fs_fs, but mostly looking at 
repository size related issues.  This issue is isolated to individual 
transactions, so it's simpler to fix and test.

At the moment the code uses memory roughly proportional to the total 
lengths of all paths in the transactions.

One approach to reducing the amount of memory needed would be to use a 
data structure that models directories, rather than complete paths.   
Each directory node should have its own lookup table; the keys can be 
just the name of the immediate child relative to this node.  
Intermediate nodes for path components that haven't been seen themselves 
should be marked as such;  if the path is later explicitly encountered, 
the mark can be cleared (or vice versa).

This approach requires space roughly proportional to the number of  
directories and files in the transaction, rather than total path 
length.  For big, flat namespaces, this isn't much of a win, but it also 
isn't much worse; as the name space gets deeper, and closer to real 
source repositories, the win gets bigger. This approach also makes it 
faster to determine parent/child relationships.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Eric Gillespie <ep...@pretzelnet.org>.

Josh Pieper <jj...@pobox.com> writes:

> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M.  It could just
> be the recursive delete of the completed transaction directory.

That's exactly what it is.  It was much, much worse until r11701
and r11706.  However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory.  I wasn't sure what to do about this, and
so left it alone.  I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.

--  
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Eric Gillespie <ep...@pretzelnet.org>.

Josh Pieper <jj...@pobox.com> writes:

> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M.  It could just
> be the recursive delete of the completed transaction directory.

That's exactly what it is.  It was much, much worse until r11701
and r11706.  However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory.  I wasn't sure what to do about this, and
so left it alone.  I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.

--  
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org