You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Chase Phillips <sh...@ameth.org> on 2004/12/13 07:22:13 UTC

crash managing a large FSFS repository

As a follow-up to my thread "svn resource requirements with large
repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
was recently able to try out the same procedure with revision 12289 from
Subversion's trunk.  With this revision I experience the same resource
usage issues that led me to raise this issue at first.

As a refresher, our project is developing a software image that runs on
top of NetBSD.  We need to store NetBSD in a revision-controlled
environment to track the changes we make to the operating system and
kernel.  I decided to create this new repository locally on the disk in
FSFS format (our current repo that stores application source is in BDB
format).

After importing the NetBSD source code and then copying it onto a branch,
a subsequent checkout leads to a core dump.  I've attached one of the
stack traces from my two attempts to this email (each attempt takes
upwards of 15 minutes before svn dumps core).  The second stack trace
differs from the first only in memory addresses of variables, though it
can be sent as well if needed.

The Subversion issue tracker holds 4 issues that come close to addressing
this but for one reason or another don't match up well enough to allow me
to assume they should be used as the target for this issue.

Issue 602 - http://subversion.tigris.org/issues/show_bug.cgi?id=602

  "import of large trees can bloat memory on client side"

  The last comment to this bug was made 2002/11 and it appears purposed to
  handle import efficiency (I've not had this resource issue doing an
  import of the source code).

Issue 1702 - http://subversion.tigris.org/issues/show_bug.cgi?id=1702

  "Initial checkout should scale well for large projects"

  This issue focuses on checking out a revision from a remote repository.
  In my scenario, I am checking out from a local repository.

Issue 2067 - http://subversion.tigris.org/issues/show_bug.cgi?id=2067

  "Perf issues with BDB and directories with a large number of items"

  This bug mentions similar problems with FSFS, though the bug summary
  refers strictly to BDB.  Is it meant to cover only issues with BDB-based
  repositories?

Issue 2137 - http://subversion.tigris.org/issues/show_bug.cgi?id=2137

  "svn update" extremely slow with large repository and FSFS"

  The common problems I experience are with checkouts and commits back to
  the local repository.  Again, I'm using a late revision of the trunk
  (r12289).

Should one of the above issues be used for tracking this problem?  Or
should I file a new issue, presuming I'm running into a bug in the source
and not some problem local to my system?  Any suggestions for what to try
next?

PS Initial work on this issue was done by Eric Gillespie.  He applied one
patch to the trunk at r11701 and another at r11706.  The appropriate dev
threads can be found at:

  "[PATCH] Fix svn_io_remove_dir pool usage", 2004/11/01
  http://svn.haxx.se/dev/archive-2004-11/0000.shtml

  "[PATCH] Fix fsfs finalization memory usage", 2004/11/01
  http://svn.haxx.se/dev/archive-2004-11/0001.shtml

Thanks,
Chase
CUWiN project

Re: crash managing a large FSFS repository

Posted by Max Bowsher <ma...@ukf.net>.
Chase Phillips wrote:
> On Mon, 13 Dec 2004, Chase Phillips wrote:
>
>> I will compile this information asap and place it in the issue tracker.
>
> I have filed issue 2167 to track this.
>
>  http://subversion.tigris.org/issues/show_bug.cgi?id=2167
>
> Am I meant not to be able to post additional information to this issue?
> If I had the ability, I would post links to the mailing list archives
> where this has been discussed.

You are supposed to request the "Observer" project role, and have it 
granted. (Though it appears that this isn't documented on the website!)

To save a round-trip of communication, I've directly added the "Observer" 
role to your account.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Chase Phillips <sh...@ameth.org>.
On Mon, 13 Dec 2004, Chase Phillips wrote:

> I will compile this information asap and place it in the issue tracker.

I have filed issue 2167 to track this.

  http://subversion.tigris.org/issues/show_bug.cgi?id=2167

Am I meant not to be able to post additional information to this issue?
If I had the ability, I would post links to the mailing list archives
where this has been discussed.

Thanks,
Chase

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Chase Phillips <sh...@ameth.org>.
On Mon, 13 Dec 2004 kfogel@collab.net wrote:

> It still looks like an out-of-memory error ("abort_on_pool_failure" in
> the stack trace), hmmm.  Both your client and server were on the same
> box, and indeed in the same process, when you reproduced this, right?

Yes.  The commands I ran to initialize the repository were:

  sh$ export REPO=/u3/cphillip/dev/cuwin/subversion-os-dev/trunk-rev-12289/repo
  (cwd) n/a

  sh$ svnadmin --create-type fsfs repo
  (cwd) /u3/cphillip/dev/cuwin/subversion-os-dev/trunk-rev-12289

  sh$ svn import file://$REPO/branches/netbsd-start/
  (cwd) /u3/cphillip/dev/cuwin/subversion-os-dev/netbsd-start/src/

  sh$ svn cp file://$REPO/branches/netbsd-start/ \
      file://$REPO/branches/netbsd/
  (cwd) n/a

The command that resulted in the core dump was:

  sh$ svn co file://$REPO/ wd
  (cwd) /u3/cphillip/dev/cuwin/subversion-os-dev/trunk-rev-12289/

The output of 'df .' from
/u3/cphillip/dev/cuwin/subversion-os-dev/trunk-rev-12289/ is:

  Filesystem  512-blocks     Used     Avail Capacity  Mounted on
  /dev/wd1a    128220292 78734656  43074624    64%    /u3

The entirety of the trunk-rev-12289/ directory and its fs children is on
/u3.

> I could try to make an educated guess from the stack trace, but it
> would be great if we could narrow this down to "server code", or
> "client code", or both.  (Even when they're in the same process,
> they're distinct bodies of code.)

Understood.  If my answers don't provide you enough information in a
particular area please ask me for clarification.  I'm happy to provide it.

> When you say "subsequent checkout", you mean a first-time checkout of
> the new branch, right?

Yes.  It is the first checkout of the whole repository (not just the new
branch) that is the next activity to occur after the copy of one branch to
another.

> What can you tell us about your hardware, memory, etc?  (Not because
> they're causing the bug in any sense, just to help us figure out what
> we need to reproduce it.)

The system is a Dell PowerEdge 400SC.  dmesg.boot states there are 639 MB
of total memory and 616 MB of available memory.  The processor is an Intel
Pentium 4 CPU (2.4GHz) with 512 KB of cache.

The software is NetBSD running a late version that has been modified to
include bug fixes from the NetBSD source code.  It is built using, quote,
a GENERIC kernel configuration file with minor patches:

  +uftdi0 at uhub? port ?         # David's travelling FTDI FT8U100AX
  +                               # serial adapter
  +ucom0  at uftdi0 portno 1

End quote.

uname -a reports:

  NetBSD cuw.ojctech.com 2.99.10 NetBSD 2.99.10 (GENERIC.cuw) #0: Sun Nov
  14 15:35:49 CST 2004 dyoung@cuw.ojctech.com:/u3/dyoung/pristine-nbsd/O/
  sys/arch/i386/compile/GENERIC.cuw i386

> Can we get our hands on your data?

Our desired initial work state is based on an import of the tarball
available at:

  http://che.ojctech.com/~dyoung/public/netbsd-start.tar.gz
  MD5: a1ca74ad688ac280433785ac8bdacb8b

My process is to run (from within netbsd-start/src/):

  sh$ svn admin --create-type fsfs /local/repo
  sh$ svn import file:///local/repo/branches/netbsd-start/
  sh$ svn cp file:///local/repo/branches/netbsd-start \
        file:///local/repo/branches/netbsd
  sh$ svn checkout file:///local/repo/
  sh$ cd branches/netbsd
  sh$ find . -type d -name CVS -exec rm -rf \{\} \;
  sh$ svn commit

With different versions, I've experienced trouble with the import,
checkout, and the commit.  In r12289, I did not have trouble with the
import and did not have the opportunity to try the commit.

> Does it reproduce with BDB instead of FSFS?

I haven't had an opportunity to try this with BDB so I don't currently
have an answer to that question.  I'm happy to just get the report in at
this point.

> > The Subversion issue tracker holds 4 issues that come close to addressing
> > this but for one reason or another don't match up well enough to allow me
> > to assume they should be used as the target for this issue.
> >
> > Issue 602 - http://subversion.tigris.org/issues/show_bug.cgi?id=602
> >
> >   "import of large trees can bloat memory on client side"
> >
> >   The last comment to this bug was made 2002/11 and it appears purposed to
> >   handle import efficiency (I've not had this resource issue doing an
> >   import of the source code).
>
> Agree that this is probably not your bug.
>
> > Issue 1702 - http://subversion.tigris.org/issues/show_bug.cgi?id=1702
> >
> >   "Initial checkout should scale well for large projects"
> >
> >   This issue focuses on checking out a revision from a remote repository.
> >   In my scenario, I am checking out from a local repository.
>
> Well, the real point is that 1702 is about time performance, not
> memory growth.  The local vs remote thing is not such a big
> difference.  Many problems that are first reported in remote
> operations are also present in local operations; it just means they
> are problems in the core libraries, not the transport layer libraries.
>
> > Issue 2067 - http://subversion.tigris.org/issues/show_bug.cgi?id=2067
> >
> >   "Perf issues with BDB and directories with a large number of items"
> >
> >   This bug mentions similar problems with FSFS, though the bug summary
> >   refers strictly to BDB.  Is it meant to cover only issues with BDB-based
> >   repositories?
>
> It looks like it's still mainly about BDB, and anyway is mainly about
> import and commit scalability, not specifically about checkouts.
>
> > Issue 2137 - http://subversion.tigris.org/issues/show_bug.cgi?id=2137
> >
> >   "svn update" extremely slow with large repository and FSFS"
> >
> >   The common problems I experience are with checkouts and commits back to
> >   the local repository.  Again, I'm using a late revision of the trunk
> >   (r12289).
>
> Your earlier description says "a subsequent checkout leads to a core
> dump".  Here you say you are also having problems with commits.  I
> suspect the checkout problems are unrelated to the commit ones, so we
> should have two separate threads for them.  In this reply, I've only
> been talking about the checkout problem (because until this moment,
> that's the only one I knew about anyway).

Okay.  IMO it's sufficient to note that I have experienced these same
resource usage issues during not just commits but other operations, as
well.  These problems occurred while attempting the process I mentioned
earlier in this email against released versions of Subversion.  I also
refer to them in my previous email to users@subversion:

  http://svn.haxx.se/users/archive-2004-11/0180.shtml

In response to these issues at that time, the possibility of a fix existed
(thanks to Eric!).  I decided to prepare for testing a new version instead
of iterating to prepare an 'official' bug report.

In this report, which describes my attempt using r12289, I did not
experience a core dump during the import, only during the checkout.

> I think a new issue would be best.  As much reproduction information
> as you can give us (numbers of files, sizes of files, names of files)
> would be great.

I will compile this information asap and place it in the issue tracker.
I was hoping that the patches already in the trunk would have been enough
to remove the resource usage issues and am eager to get our team unblocked
on doing more serious NetBSD SCM.

> Thanks very much for the report -- and the care you took to make it so
> organized & comprehensible.

I've been on all sides of the stick with bug reports (not giving enough
info, giving too much, not having enough, having too much).  Finding the
sweet spot for a report can be a real trick.  And in some cases generating
a useful report takes so long that it's just not easy to do as a
volunteer.  Makes the fast turnaround times the Subversion team pulls off
very impressive.

Thank you to the Subversion team and its volunteers for the thousands of
man-years and more that have gone into this project!

Chase

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.
Chase Phillips <sh...@ameth.org> writes:
> As a follow-up to my thread "svn resource requirements with large
> repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
> was recently able to try out the same procedure with revision 12289 from
> Subversion's trunk.  With this revision I experience the same resource
> usage issues that led me to raise this issue at first.
> 
> As a refresher, our project is developing a software image that runs on
> top of NetBSD.  We need to store NetBSD in a revision-controlled
> environment to track the changes we make to the operating system and
> kernel.  I decided to create this new repository locally on the disk in
> FSFS format (our current repo that stores application source is in BDB
> format).
> 
> After importing the NetBSD source code and then copying it onto a branch,
> a subsequent checkout leads to a core dump.  I've attached one of the
> stack traces from my two attempts to this email (each attempt takes
> upwards of 15 minutes before svn dumps core).  The second stack trace
> differs from the first only in memory addresses of variables, though it
> can be sent as well if needed.

It still looks like an out-of-memory error ("abort_on_pool_failure" in
the stack trace), hmmm.  Both your client and server were on the same
box, and indeed in the same process, when you reproduced this, right?
I could try to make an educated guess from the stack trace, but it
would be great if we could narrow this down to "server code", or
"client code", or both.  (Even when they're in the same process,
they're distinct bodies of code.)

When you say "subsequent checkout", you mean a first-time checkout of
the new branch, right?

What can you tell us about your hardware, memory, etc?  (Not because
they're causing the bug in any sense, just to help us figure out what
we need to reproduce it.)

Can we get our hands on your data?

Does it reproduce with BDB instead of FSFS?
 
> The Subversion issue tracker holds 4 issues that come close to addressing
> this but for one reason or another don't match up well enough to allow me
> to assume they should be used as the target for this issue.
> 
> Issue 602 - http://subversion.tigris.org/issues/show_bug.cgi?id=602
> 
>   "import of large trees can bloat memory on client side"
> 
>   The last comment to this bug was made 2002/11 and it appears purposed to
>   handle import efficiency (I've not had this resource issue doing an
>   import of the source code).

Agree that this is probably not your bug.

> Issue 1702 - http://subversion.tigris.org/issues/show_bug.cgi?id=1702
> 
>   "Initial checkout should scale well for large projects"
> 
>   This issue focuses on checking out a revision from a remote repository.
>   In my scenario, I am checking out from a local repository.

Well, the real point is that 1702 is about time performance, not
memory growth.  The local vs remote thing is not such a big
difference.  Many problems that are first reported in remote
operations are also present in local operations; it just means they
are problems in the core libraries, not the transport layer libraries.

> Issue 2067 - http://subversion.tigris.org/issues/show_bug.cgi?id=2067
> 
>   "Perf issues with BDB and directories with a large number of items"
> 
>   This bug mentions similar problems with FSFS, though the bug summary
>   refers strictly to BDB.  Is it meant to cover only issues with BDB-based
>   repositories?

It looks like it's still mainly about BDB, and anyway is mainly about
import and commit scalability, not specifically about checkouts.

> Issue 2137 - http://subversion.tigris.org/issues/show_bug.cgi?id=2137
> 
>   "svn update" extremely slow with large repository and FSFS"
> 
>   The common problems I experience are with checkouts and commits back to
>   the local repository.  Again, I'm using a late revision of the trunk
>   (r12289).

Your earlier description says "a subsequent checkout leads to a core
dump".  Here you say you are also having problems with commits.  I
suspect the checkout problems are unrelated to the commit ones, so we
should have two separate threads for them.  In this reply, I've only
been talking about the checkout problem (because until this moment,
that's the only one I knew about anyway).
 
> Should one of the above issues be used for tracking this problem?  Or
> should I file a new issue, presuming I'm running into a bug in the source
> and not some problem local to my system?  Any suggestions for what to try
> next?

I think a new issue would be best.  As much reproduction information
as you can give us (numbers of files, sizes of files, names of files)
would be great.

Thanks very much for the report -- and the care you took to make it so
organized & comprehensible.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:

>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg.  Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes.  I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236.  Plus we need a
>4-byte pointer per entry...
>
>  
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup 
table needn't be a hash table; a binary search table may be better, 
especially if  the input data is mostly sorted.  That could bring the 
overhead down to ~4 bytes.

>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising.  Maybe I'm missing something, though?
>  
>
Reflexive skepticism is what keeps us alive :)  There may also be 
interactions with the pool allocator;  I do think that path length 
explains at least 25% of the memory growth.  I think it's time to run a 
profiler and see where the memory is going (but that spoils the fun).

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:

>middle metric was about), we can multiply 51,236 * 8 to get 409,888.
>Nice, about half a meg.  Of course, we need to add in the usual tree
>structure overhead, which is a whole hash-table per unique entry
>except for the leaf nodes.  I'm not really sure how to estimate that.
>It's more than log(51,236), but less than 51,236.  Plus we need a
>4-byte pointer per entry...
>
>  
>
Regular hashtable overhead is ~ 24 bytes per node. The per node lookup 
table needn't be a hash table; a binary search table may be better, 
especially if  the input data is mostly sorted.  That could bring the 
overhead down to ~4 bytes.

>So, is it really looking so much better than 9 MB, in the long run?
>
>I don't mean to be reflexively skeptical, but at least this back-of-the-envelope estimate doesn't look promising.  Maybe I'm missing something, though?
>  
>
Reflexive skepticism is what keeps us alive :)  There may also be 
interactions with the pool allocator;  I do think that path length 
explains at least 25% of the memory growth.  I think it's time to run a 
profiler and see where the memory is going (but that spoils the fun).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> Noise sources :
>     Original report is for  a memory spike from 19Mb ->  44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is  always a power of two; hash node size is ~20
> bytes.
>    First metric was to run  find . >/tmp/find-netbsd.   Total  size is
> 8,139,654 Bytes. (wc -c)
>     Number of entries:  193,716  (wc -l)
>     Average path length: ~42 bytes
>     Measurements were made relative to '.' ;  paths in memory would be
> relative to the root of the repository.  Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB  in this case)

Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB

That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?

> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
>     Total size: 1,654,088
>     Number of entries:  193,716  (wc -l)
>     Average size: 8 bytes

(What was this metric for?)

This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing.  Hmmm.  I confess I'm not sure what to think here.

> Third metric: size of interned path-name components (sed ... | sort | uniq)
>     Total size: 577,432
>     Number of unique strings: 51,236

(This was for estimating if we used a tree-style data structure, right?)

So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg.  Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes.  I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236.  Plus we need a
4-byte pointer per entry...

So, is it really looking so much better than 9 MB, in the long run?

I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising.  Maybe I'm
missing something, though?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> Noise sources :
>     Original report is for  a memory spike from 19Mb ->  44Mb, so
> results on the order of megabytes are possibly significant.
> Hashtable array size is  always a power of two; hash node size is ~20
> bytes.
>    First metric was to run  find . >/tmp/find-netbsd.   Total  size is
> 8,139,654 Bytes. (wc -c)
>     Number of entries:  193,716  (wc -l)
>     Average path length: ~42 bytes
>     Measurements were made relative to '.' ;  paths in memory would be
> relative to the root of the repository.  Adding /trunk/ to start of
> each path would use an extra 6 chars per entry (~1.1MB  in this case)

Okay, so 193,716 * (42 + 6) == 9,298,368 == appx 9 MB

That's bad, but is it responsible for a 25 MB jump in memory usage?
Is there other data associated with each of these paths, whose size is
proportional to path depth?

> Second metric is to strip out everthing but the last name component:
> ( sed -e 's;^.*/;;' )
>     Total size: 1,654,088
>     Number of entries:  193,716  (wc -l)
>     Average size: 8 bytes

(What was this metric for?)

This gets us 1549728 == appx 1.5 MB, much better relative to the
earlier figure, but somewhat less better relative to the mem jump
we're seeing.  Hmmm.  I confess I'm not sure what to think here.

> Third metric: size of interned path-name components (sed ... | sort | uniq)
>     Total size: 577,432
>     Number of unique strings: 51,236

(This was for estimating if we used a tree-style data structure, right?)

So, taking an average size of 8 bytes (ah, maybe that's what your
middle metric was about), we can multiply 51,236 * 8 to get 409,888.
Nice, about half a meg.  Of course, we need to add in the usual tree
structure overhead, which is a whole hash-table per unique entry
except for the leaf nodes.  I'm not really sure how to estimate that.
It's more than log(51,236), but less than 51,236.  Plus we need a
4-byte pointer per entry...

So, is it really looking so much better than 9 MB, in the long run?

I don't mean to be reflexively skeptical, but at least this
back-of-the-envelope estimate doesn't look promising.  Maybe I'm
missing something, though?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:

>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>    
>>
>
>is both true and the cause of our problems.  I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>  
>
   
A few rough estimators:

The original problem report was for problems importing the NetBSD source 
tree, so I unpacked the  files from the NetBSD 2.0 source iso.   
Original report is for fewer files  (~120,000) , but we're just doing 
big O here.

Noise sources :
    Original report is for  a memory spike from 19Mb ->  44Mb, so 
results on the order of megabytes are possibly significant. 
    Hashtable array size is  always a power of two; hash node size is 
~20 bytes.
   

First metric was to run  find . >/tmp/find-netbsd. 
    Total  size is 8,139,654 Bytes. (wc -c)
    Number of entries:  193,716  (wc -l)
    Average path length: ~42 bytes
    Measurements were made relative to '.' ;  paths in memory would be 
relative to the root of the repository.  Adding /trunk/ to start of each 
path would use an extra 6 chars per entry (~1.1MB  in this case)

Second metric is to strip out everthing but the last name component:  ( 
sed -e 's;^.*/;;' )
    Total size: 1,654,088
    Number of entries:  193,716  (wc -l)
    Average size: 8 bytes

Third metric: size of interned path-name components (sed ... | sort | uniq)
    Total size: 577,432
    Number of unique strings: 51,236

Bonus  metric: estimate entropy using  bzip2 -9 -v
    Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
    Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
    Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out


Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
kfogel@collab.net wrote:

>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>    
>>
>
>is both true and the cause of our problems.  I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>  
>
   
A few rough estimators:

The original problem report was for problems importing the NetBSD source 
tree, so I unpacked the  files from the NetBSD 2.0 source iso.   
Original report is for fewer files  (~120,000) , but we're just doing 
big O here.

Noise sources :
    Original report is for  a memory spike from 19Mb ->  44Mb, so 
results on the order of megabytes are possibly significant. 
    Hashtable array size is  always a power of two; hash node size is 
~20 bytes.
   

First metric was to run  find . >/tmp/find-netbsd. 
    Total  size is 8,139,654 Bytes. (wc -c)
    Number of entries:  193,716  (wc -l)
    Average path length: ~42 bytes
    Measurements were made relative to '.' ;  paths in memory would be 
relative to the root of the repository.  Adding /trunk/ to start of each 
path would use an extra 6 chars per entry (~1.1MB  in this case)

Second metric is to strip out everthing but the last name component:  ( 
sed -e 's;^.*/;;' )
    Total size: 1,654,088
    Number of entries:  193,716  (wc -l)
    Average size: 8 bytes

Third metric: size of interned path-name components (sed ... | sort | uniq)
    Total size: 577,432
    Number of unique strings: 51,236

Bonus  metric: estimate entropy using  bzip2 -9 -v
    Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
    Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
    Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out


Re: crash managing a large FSFS repository

Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such;  if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
> 
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length.  For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.

This is how the Subversion repository itself is structured, actually.

The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed().  We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).  

We'd also want to write functions for accessing the tree structure,
for example:

   svn_error_t *
   svn_tree_has_path (svn_boolean_t *has_path,
                      svn_tree_t *tree,
                      const char *path);

Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion

> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.

is both true and the cause of our problems.  I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by kf...@collab.net.
Simon Spero <se...@unc.edu> writes:
> One approach to reducing the amount of memory needed would be to use a
> data structure that models directories, rather than complete paths.
> Each directory node should have its own lookup table; the keys can be
> just the name of the immediate child relative to this node.
> Intermediate nodes for path components that haven't been seen
> themselves should be marked as such;  if the path is later explicitly
> encountered, the mark can be cleared (or vice versa).
> 
> This approach requires space roughly proportional to the number of
> directories and files in the transaction, rather than total path
> length.  For big, flat namespaces, this isn't much of a win, but it
> also isn't much worse; as the name space gets deeper, and closer to
> real source repositories, the win gets bigger. This approach also
> makes it faster to determine parent/child relationships.

This is how the Subversion repository itself is structured, actually.

The current interface of fetch_all_changes() is a result of the public
API it is supporting, namely, svn_fs_paths_changed().  We could
certainly make a new svn_fs_paths_changed2() that returns the
information in a different way, and adjust the internal code
accordingly (the old code would just become the obvious wrapper,
converting the tree structure to a flat hash).  

We'd also want to write functions for accessing the tree structure,
for example:

   svn_error_t *
   svn_tree_has_path (svn_boolean_t *has_path,
                      svn_tree_t *tree,
                      const char *path);

Before we go down this road, though, we'd want to make absolutely sure
that the problem is the total paths length, that is, that the
assertion

> At the moment the code uses memory roughly proportional to the total
> lengths of all paths in the transactions.

is both true and the cause of our problems.  I'm pretty sure it's
true, of course, it's the second half I'm not positive about :-).
Are you sure that path lengths are relevant to total memory usage, or
are they just lost in the noise?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
Eric Gillespie wrote:

>That's exactly what it is.  It was much, much worse until r11701 and r11706.  However, fs_fs.c:fetch_all_changes still builds a giant hash in memory.  I wasn't sure what to do about this, and so left it alone.  I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>  
>
I've been looking at scaling issues with fs_fs, but mostly looking at 
repository size related issues.  This issue is isolated to individual 
transactions, so it's simpler to fix and test.

At the moment the code uses memory roughly proportional to the total 
lengths of all paths in the transactions.

One approach to reducing the amount of memory needed would be to use a 
data structure that models directories, rather than complete paths.   
Each directory node should have its own lookup table; the keys can be 
just the name of the immediate child relative to this node.  
Intermediate nodes for path components that haven't been seen themselves 
should be marked as such;  if the path is later explicitly encountered, 
the mark can be cleared (or vice versa).

This approach requires space roughly proportional to the number of  
directories and files in the transaction, rather than total path 
length.  For big, flat namespaces, this isn't much of a win, but it also 
isn't much worse; as the name space gets deeper, and closer to real 
source repositories, the win gets bigger. This approach also makes it 
faster to determine parent/child relationships.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Simon Spero <se...@unc.edu>.
Eric Gillespie wrote:

>That's exactly what it is.  It was much, much worse until r11701 and r11706.  However, fs_fs.c:fetch_all_changes still builds a giant hash in memory.  I wasn't sure what to do about this, and so left it alone.  I seem to recall asking for suggestions but not getting a response, but it's possible i overlooked it as i became busy elsewhere right afterwards.
>  
>
I've been looking at scaling issues with fs_fs, but mostly looking at 
repository size related issues.  This issue is isolated to individual 
transactions, so it's simpler to fix and test.

At the moment the code uses memory roughly proportional to the total 
lengths of all paths in the transactions.

One approach to reducing the amount of memory needed would be to use a 
data structure that models directories, rather than complete paths.   
Each directory node should have its own lookup table; the keys can be 
just the name of the immediate child relative to this node.  
Intermediate nodes for path components that haven't been seen themselves 
should be marked as such;  if the path is later explicitly encountered, 
the mark can be cleared (or vice versa).

This approach requires space roughly proportional to the number of  
directories and files in the transaction, rather than total path 
length.  For big, flat namespaces, this isn't much of a win, but it also 
isn't much worse; as the name space gets deeper, and closer to real 
source repositories, the win gets bigger. This approach also makes it 
faster to determine parent/child relationships.

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Eric Gillespie <ep...@pretzelnet.org>.
Josh Pieper <jj...@pobox.com> writes:

> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M.  It could just
> be the recursive delete of the completed transaction directory.

That's exactly what it is.  It was much, much worse until r11701
and r11706.  However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory.  I wasn't sure what to do about this, and
so left it alone.  I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.

--  
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Eric Gillespie <ep...@pretzelnet.org>.
Josh Pieper <jj...@pobox.com> writes:

> the majority of the import was 19M, however, after the final revision
> file was moved into place, memory usage spiked to 44M.  It could just
> be the recursive delete of the completed transaction directory.

That's exactly what it is.  It was much, much worse until r11701
and r11706.  However, fs_fs.c:fetch_all_changes still builds a
giant hash in memory.  I wasn't sure what to do about this, and
so left it alone.  I seem to recall asking for suggestions but
not getting a response, but it's possible i overlooked it as i
became busy elsewhere right afterwards.

--  
Eric Gillespie <*> epg@pretzelnet.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Josh Pieper <jj...@pobox.com>.
Chase Phillips wrote:
> As a follow-up to my thread "svn resource requirements with large
> repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
> was recently able to try out the same procedure with revision 12289 from
> Subversion's trunk.  With this revision I experience the same resource
> usage issues that led me to raise this issue at first.
> 
> As a refresher, our project is developing a software image that runs on
> top of NetBSD.  We need to store NetBSD in a revision-controlled
> environment to track the changes we make to the operating system and
> kernel.  I decided to create this new repository locally on the disk in
> FSFS format (our current repo that stores application source is in BDB
> format).
> 
> After importing the NetBSD source code and then copying it onto a branch,
> a subsequent checkout leads to a core dump.  I've attached one of the
> stack traces from my two attempts to this email (each attempt takes
> upwards of 15 minutes before svn dumps core).  The second stack trace
> differs from the first only in memory addresses of variables, though it
> can be sent as well if needed.

I just attempted this experiment using a local FSFS repository and
what I think is the entirety of the NetBSD source tree.  Import was
successful after 42 minutes wall time.  Maximum memory usage during
the majority of the import was 19M, however, after the final revision
file was moved into place, memory usage spiked to 44M.  It could just
be the recursive delete of the completed transaction directory.  It
had around 120,000 files in it after finalization.

Checkout was successful too, after 43 minutes wall time.  Memory usage
climbed steadily throughout the entire checkout, peaking at 85M.  I
didn't remember that there was a component of checkout/update that was
supposed to be linear in memory?

Just for record, the NetBSD sources I used totaled 973 megabytes, with
95,900 files or so.  The checked out working copy totaled 3,336
megabytes in size.  The experiments were on my relatively unloaded
Athlon XP 1900 with a 5400 rpm 80G hard drive.

-Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: crash managing a large FSFS repository

Posted by Josh Pieper <jj...@pobox.com>.
Chase Phillips wrote:
> As a follow-up to my thread "svn resource requirements with large
> repositories?" (http://svn.haxx.se/users/archive-2004-11/0180.shtml), I
> was recently able to try out the same procedure with revision 12289 from
> Subversion's trunk.  With this revision I experience the same resource
> usage issues that led me to raise this issue at first.
> 
> As a refresher, our project is developing a software image that runs on
> top of NetBSD.  We need to store NetBSD in a revision-controlled
> environment to track the changes we make to the operating system and
> kernel.  I decided to create this new repository locally on the disk in
> FSFS format (our current repo that stores application source is in BDB
> format).
> 
> After importing the NetBSD source code and then copying it onto a branch,
> a subsequent checkout leads to a core dump.  I've attached one of the
> stack traces from my two attempts to this email (each attempt takes
> upwards of 15 minutes before svn dumps core).  The second stack trace
> differs from the first only in memory addresses of variables, though it
> can be sent as well if needed.

I just attempted this experiment using a local FSFS repository and
what I think is the entirety of the NetBSD source tree.  Import was
successful after 42 minutes wall time.  Maximum memory usage during
the majority of the import was 19M, however, after the final revision
file was moved into place, memory usage spiked to 44M.  It could just
be the recursive delete of the completed transaction directory.  It
had around 120,000 files in it after finalization.

Checkout was successful too, after 43 minutes wall time.  Memory usage
climbed steadily throughout the entire checkout, peaking at 85M.  I
didn't remember that there was a component of checkout/update that was
supposed to be linear in memory?

Just for record, the NetBSD sources I used totaled 973 megabytes, with
95,900 files or so.  The checked out working copy totaled 3,336
megabytes in size.  The experiments were on my relatively unloaded
Athlon XP 1900 with a 5400 rpm 80G hard drive.

-Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org