You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Josh Pieper <jj...@pobox.com> on 2004/12/13 14:21:11 UTC
Re: crash managing a large FSFS repository
Chase Phillips wrote:
> As a follow-up to my thread "svn resource requirements with large
> repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
> was recently able to try out the same procedure with revision 12289 from
> Subversion's trunk. With this revision I experience the same resource
> usage issues that led me to raise this issue at first.
>
> As a refresher, our project is developing a software image that runs on
> top of NetBSD. We need to store NetBSD in a revision-controlled
> environment to track the changes we make to the operating system and
> kernel. I decided to create this new repository locally on the disk in
> FSFS format (our current repo that stores application source is in BDB
> format).
>
> After importing the NetBSD source code and then copying it onto a branch,
> a subsequent checkout leads to a core dump. I've attached one of the
> stack traces from my two attempts to this email (each attempt takes
> upwards of 15 minutes before svn dumps core). The second stack trace
> differs from the first only in memory addresses of variables, though it
> can be sent as well if needed.
I just attempted this experiment using a local FSFS repository and
what I think is the entirety of the NetBSD source tree. Import was
successful after 42 minutes wall time. Maximum memory usage during
the majority of the import was 19M, however, after the final revision
file was moved into place, memory usage spiked to 44M. It could just
be the recursive delete of the completed transaction directory. It
had around 120,000 files in it after finalization.
Checkout was successful too, after 43 minutes wall time. Memory usage
climbed steadily throughout the entire checkout, peaking at 85M. I
didn't remember that there was a component of checkout/update that was
supposed to be linear in memory?
Just for record, the NetBSD sources I used totaled 973 megabytes, with
95,900 files or so. The checked out working copy totaled 3,336
megabytes in size. The experiments were on my relatively unloaded
Athlon XP 1900 with a 5400 rpm 80G hard drive.
-Josh
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:
>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg. Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes. I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236. Plus we need a
>4-byte pointer per entry...
>
>
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup
table needn't be a hash table; a binary search table may be better,
especially if the input data is mostly sorted. That could bring the
overhead down to ~4 bytes.
>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising. Maybe I'm missing something, though?
>
>
Reflexive skepticism is what keeps us alive :) There may also be
interactions with the pool allocator; I do think that path length
explains at least 25% of the memory growth. I think it's time to run a
profiler and see where the memory is going (but that spoils the fun).
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:
>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg. Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes. I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236. Plus we need a
>4-byte pointer per entry...
>
>
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup
table needn't be a hash table; a binary search table may be better,
especially if the input data is mostly sorted. That could bring the
overhead down to ~4 bytes.
>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising. Maybe I'm missing something, though?
>
>
Reflexive skepticism is what keeps us alive :) There may also be
interactions with the pool allocator; I do think that path length
explains at least 25% of the memory growth. I think it's time to run a
profiler and see where the memory is going (but that spoils the fun).
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> Noise sources :
> Original report is for a memory spike from 19Mb -> 44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is always a power of two; hash node size is ~20
> bytes.
> First metric was to run find . >/tmp/find-netbsd. Total size is
> 8,139,654 Bytes. (wc -c)
> Number of entries: 193,716 (wc -l)
> Average path length: ~42 bytes
> Measurements were made relative to '.' ; paths in memory would be
> relative to the root of the repository. Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB in this case)
Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB
That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?
> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
> Total size: 1,654,088
> Number of entries: 193,716 (wc -l)
> Average size: 8 bytes
(What was this metric for?)
This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing. Hmmm. I confess I'm not sure what to think here.
> Third metric: size of interned path-name components (sed ... | sort | uniq)
> Total size: 577,432
> Number of unique strings: 51,236
(This was for estimating if we used a tree-style data structure, right?)
So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg. Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes. I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236. Plus we need a
4-byte pointer per entry...
So, is it really looking so much better than 9 MB, in the long run?
I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising. Maybe I'm
missing something, though?
-Karl
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> Noise sources :
> Original report is for a memory spike from 19Mb -> 44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is always a power of two; hash node size is ~20
> bytes.
> First metric was to run find . >/tmp/find-netbsd. Total size is
> 8,139,654 Bytes. (wc -c)
> Number of entries: 193,716 (wc -l)
> Average path length: ~42 bytes
> Measurements were made relative to '.' ; paths in memory would be
> relative to the root of the repository. Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB in this case)
Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB
That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?
> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
> Total size: 1,654,088
> Number of entries: 193,716 (wc -l)
> Average size: 8 bytes
(What was this metric for?)
This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing. Hmmm. I confess I'm not sure what to think here.
> Third metric: size of interned path-name components (sed ... | sort | uniq)
> Total size: 577,432
> Number of unique strings: 51,236
(This was for estimating if we used a tree-style data structure, right?)
So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg. Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes. I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236. Plus we need a
4-byte pointer per entry...
So, is it really looking so much better than 9 MB, in the long run?
I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising. Maybe I'm
missing something, though?
-Karl
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:
>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>
>>
>
>is both true and the cause of our problems. I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>
>
A few rough estimators:
The original problem report was for problems importing the NetBSD source
tree, so I unpacked the files from the NetBSD 2.0 source iso.
Original report is for fewer files (~120,000) , but we're just doing
big O here.
Noise sources :
Original report is for a memory spike from 19Mb -> 44Mb, so
results on the order of megabytes are possibly significant.
Hashtable array size is always a power of two; hash node size is
~20 bytes.
First metric was to run find . >/tmp/find-netbsd.
Total size is 8,139,654 Bytes. (wc -c)
Number of entries: 193,716 (wc -l)
Average path length: ~42 bytes
Measurements were made relative to '.' ; paths in memory would be
relative to the root of the repository. Adding /trunk/ to start of each
path would use an extra 6 chars per entry (~1.1MB in this case)
Second metric is to strip out everthing but the last name component: (
sed -e 's;^.*/;;' )
Total size: 1,654,088
Number of entries: 193,716 (wc -l)
Average size: 8 bytes
Third metric: size of interned path-name components (sed ... | sort | uniq)
Total size: 577,432
Number of unique strings: 51,236
Bonus metric: estimate entropy using bzip2 -9 -v
Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:
>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>
>>
>
>is both true and the cause of our problems. I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>
>
A few rough estimators:
The original problem report was for problems importing the NetBSD source
tree, so I unpacked the files from the NetBSD 2.0 source iso.
Original report is for fewer files (~120,000) , but we're just doing
big O here.
Noise sources :
Original report is for a memory spike from 19Mb -> 44Mb, so
results on the order of megabytes are possibly significant.
Hashtable array size is always a power of two; hash node size is
~20 bytes.
First metric was to run find . >/tmp/find-netbsd.
Total size is 8,139,654 Bytes. (wc -c)
Number of entries: 193,716 (wc -l)
Average path length: ~42 bytes
Measurements were made relative to '.' ; paths in memory would be
relative to the root of the repository. Adding /trunk/ to start of each
path would use an extra 6 chars per entry (~1.1MB in this case)
Second metric is to strip out everthing but the last name component: (
sed -e 's;^.*/;;' )
Total size: 1,654,088
Number of entries: 193,716 (wc -l)
Average size: 8 bytes
Third metric: size of interned path-name components (sed ... | sort | uniq)
Total size: 577,432
Number of unique strings: 51,236
Bonus metric: estimate entropy using bzip2 -9 -v
Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out
Re: crash managing a large FSFS repository
Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such; if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
>
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length. For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.
This is how the Subversion repository itself is structured, actually.
The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed(). We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).
We'd also want to write functions for accessing the tree structure,
for example:
svn_error_t *
svn_tree_has_path (svn_boolean_t *has_path,
svn_tree_t *tree,
const char *path);
Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion
> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.
is both true and the cause of our problems. I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such; if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
>
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length. For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.
This is how the Subversion repository itself is structured, actually.
The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed(). We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).
We'd also want to write functions for accessing the tree structure,
for example:
svn_error_t *
svn_tree_has_path (svn_boolean_t *has_path,
svn_tree_t *tree,
const char *path);
Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion
> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.
is both true and the cause of our problems. I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
Eric Gillespie wrote:
>That's exactly what it is. It was much, much worse until r11701 and r11706. However, fs_fs.c:fetch_all_changes still builds a giant hash in memory. I wasn't sure what to do about this, and so left it alone. I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>
>
I've been looking at scaling issues with fs_fs, but mostly looking at
repository size related issues. This issue is isolated to individual
transactions, so it's simpler to fix and test.
At the moment the code uses memory roughly proportional to the total
lengths of all paths in the transactions.
One approach to reducing the amount of memory needed would be to use a
data structure that models directories, rather than complete paths.
Each directory node should have its own lookup table; the keys can be
just the name of the immediate child relative to this node.
Intermediate nodes for path components that haven't been seen themselves
should be marked as such; if the path is later explicitly encountered,
the mark can be cleared (or vice versa).
This approach requires space roughly proportional to the number of
directories and files in the transaction, rather than total path
length. For big, flat namespaces, this isn't much of a win, but it also
isn't much worse; as the name space gets deeper, and closer to real
source repositories, the win gets bigger. This approach also makes it
faster to determine parent/child relationships.
Simon
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Simon Spero <se...@unc.edu>.
Eric Gillespie wrote:
>That's exactly what it is. It was much, much worse until r11701 and r11706. However, fs_fs.c:fetch_all_changes still builds a giant hash in memory. I wasn't sure what to do about this, and so left it alone. I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>
>
I've been looking at scaling issues with fs_fs, but mostly looking at
repository size related issues. This issue is isolated to individual
transactions, so it's simpler to fix and test.
At the moment the code uses memory roughly proportional to the total
lengths of all paths in the transactions.
One approach to reducing the amount of memory needed would be to use a
data structure that models directories, rather than complete paths.
Each directory node should have its own lookup table; the keys can be
just the name of the immediate child relative to this node.
Intermediate nodes for path components that haven't been seen themselves
should be marked as such; if the path is later explicitly encountered,
the mark can be cleared (or vice versa).
This approach requires space roughly proportional to the number of
directories and files in the transaction, rather than total path
length. For big, flat namespaces, this isn't much of a win, but it also
isn't much worse; as the name space gets deeper, and closer to real
source repositories, the win gets bigger. This approach also makes it
faster to determine parent/child relationships.
Simon
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Eric Gillespie <ep...@pretzelnet.org>.
Josh Pieper <jj...@pobox.com> writes:
> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M. It could just
> be the recursive delete of the completed transaction directory.
That's exactly what it is. It was much, much worse until r11701
and r11706. However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory. I wasn't sure what to do about this, and
so left it alone. I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.
--
Eric Gillespie <*> epg@pretzelnet.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: crash managing a large FSFS repository
Posted by Eric Gillespie <ep...@pretzelnet.org>.
Josh Pieper <jj...@pobox.com> writes:
> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M. It could just
> be the recursive delete of the completed transaction directory.
That's exactly what it is. It was much, much worse until r11701
and r11706. However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory. I wasn't sure what to do about this, and
so left it alone. I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.
--
Eric Gillespie <*> epg@pretzelnet.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org