You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Jeremy Cook <je...@bccs.uib.no> on 2006/08/23 10:33:59 UTC

Problem with large files

Dear Developers,

I have been having problems with subversion and (very) large files:

jeremy@blue:/media/TeraStor/MediaDev$ svn import 2003-05-20-02.avi
file:///scratch/jeremy/svn/bigone -m "foba"
svn: Can't check path '2003-05-20-02.avi': Value too large for defined data type

After much googling around I found that this problem has been
mentioned in various guises, and boils down to a bug or feature in the
APR library that subverions uses (even though I am not using Apache).

I applied the patches described here (by hand)
http://svn.haxx.se/users/archive-2005-06/0414.shtml
and the problem is now apparently solved for me:

jeremy@blue:/media/TeraStor/MediaDev$ ls -al 2003-05-20-02.avi
-rw-r--r--  1 jeremy family 2745350144 2006-08-22 16:00 2003-05-20-02.avi
jeremy@blue:/media/TeraStor/MediaDev$ svn import 2003-05-20-02.avi
file:///scratch/jeremy/svn/bigone -m "foba"
Adding  (bin)  2003-05-20-02.avi

Committed revision 1.

I am using subversion version svn, version 1.3.2 (r19776)

I am not all that familiar with the internals of subversion, nor do I
want to be, but I do want it to handle large files. So if someone can
reply and acknowledge whether this is a bug and whether there are any
forseen problems with the fix that I found, I will be happy to make a
report to the issue tracker.

The reason why (in case you were wondering), I am trying to use
subversion on my very large DV avi files, is so that I can have a
central repository. My entire library is much bigger that the file
system on a PC - I have it on a Lacie 1 Tera Byte disk, that is
attached to a linux file server. With svn, I plan to check out and
lock video clips and work on them on a Win XP system. When I am done I
can then put my edits, clips and home videos into the subversioned
library for safe keeping...

Thanks for your help,

Jeremy Cook


-- 
Jeremy.Cook@bccs.uib.no                        tlf: +47 55 58 40 65
Parallab                  Bergen Centre for Computational Science

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
> >
> > So can we just make that code conditional on the repository being
> > svndiff1 and call it a day?  Or is there something more problematic
> > that I'm missing here?
>
> You could, but it's not easy and looks ugly.
> Because of how many layers of abstraction we have between the fs and
> this code, you end up propagating svndiff version numbers through
> large parts of the API where it doesn't make sense to do so.
>
> At least, this was my impression the last time i tried.

Ick, you're right, that is pretty nasty.  Lots of levels of stuff that
need to care...

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Daniel Berlin <db...@dberlin.org>.

> 
> So can we just make that code conditional on the repository being
> svndiff1 and call it a day?  Or is there something more problematic
> that I'm missing here?

You could, but it's not easy and looks ugly.
Because of how many layers of abstraction we have between the fs and
this code, you end up propagating svndiff version numbers through
large parts of the API where it doesn't make sense to do so.

At least, this was my impression the last time i tried.
>
> -garrett
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
> >
> > Is it still reasonably fast for subsequent revisions?
>
> For subsequent revisions, we use xdelta, like we did before, so yes.
>
> Note that not using vdelta also speeds up later delta combination
> time, because the initial revision being vdelta can still cause the
> bad behavior in the delta combiner that we had before (it just does so
> to a lesser degree than *always* using vdelta).
>
> > Or does this
> > just improve the initial commit?
>
> It improves fulltext commit speeds, and speeds up later updates and
> checkouts by speeding up delta combination as a side-effect of not
> using vdelta.

Nice.

> > I understand why this would only be used for svndiff1 repositories,
> > but why would it be problematic with 1.3 clients?
>
> It isn't problematic, it's just that if you apply  the patch to 1.3
> and use that as the server, you will get no compression for the
> fulltext revisions in the repo.

So can we just make that code conditional on the repository being
svndiff1 and call it a day?  Or is there something more problematic
that I'm missing here?

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Daniel Berlin <db...@dberlin.org>.

> 
> Is it still reasonably fast for subsequent revisions?

For subsequent revisions, we use xdelta, like we did before, so yes.

Note that not using vdelta also speeds up later delta combination
time, because the initial revision being vdelta can still cause the
bad behavior in the delta combiner that we had before (it just does so
to a lesser degree than *always* using vdelta).

> Or does this
> just improve the initial commit?

It improves fulltext commit speeds, and speeds up later updates and
checkouts by speeding up delta combination as a side-effect of not
using vdelta.
>
> I understand why this would only be used for svndiff1 repositories,
> but why would it be problematic with 1.3 clients?

It isn't problematic, it's just that if you apply  the patch to 1.3
and use that as the server, you will get no compression for the
fulltext revisions in the repo.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> > On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
> >
> > > The report from the one person who has ever tried it with large files
> > > was that it sped up commit times from 45 minutes to less than 5 ;)
> >
> > I don't think that this is rocket science which requires testing.  :-)
> >  Of course, if you just insert new data directly into the stream
> > without trying to deltify it, it's gonna be way way faster.
>
> Uh, no, you don't understand.
> The patch only changes the behavior when deltaing against the empty
> stream (ie for the first file revision).
>
> Right now, we do that using vdelta, but only because it does target
> side deltas, and generates something roughly comparable to zlib, but
> using about 10x more cpu time.
>
> With the patch, svndiff1 using clients will still compress the file,
> but only using zlib.
>
> It turns out that this actually gets better results size wise than
> zlib was anyway.
>
> The patch does *not* change things so that the initial revision is
> uncompressed in any way, it just lets svndiff1 compression (zlib) do
> it instead of the much more expensive vdelta.

Is it still reasonably fast for subsequent revisions?  Or does this
just improve the initial commit?

I understand why this would only be used for svndiff1 repositories,
but why would it be problematic with 1.3 clients?

-garrett

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Daniel Berlin <db...@dberlin.org>.

On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
>
> > The report from the one person who has ever tried it with large files
> > was that it sped up commit times from 45 minutes to less than 5 ;)
>
> I don't think that this is rocket science which requires testing.  :-)
>  Of course, if you just insert new data directly into the stream
> without trying to deltify it, it's gonna be way way faster.

Uh, no, you don't understand.
The patch only changes the behavior when deltaing against the empty
stream (ie for the first file revision).

Right now, we do that using vdelta, but only because it does target
side deltas, and generates something roughly comparable to zlib, but
using about 10x more cpu time.

With the patch, svndiff1 using clients will still compress the file,
but only using zlib.

It turns out that this actually gets better results size wise than
zlib was anyway.

The patch does *not* change things so that the initial revision is
uncompressed in any way, it just lets svndiff1 compression (zlib) do
it instead of the much more expensive vdelta.

HTH,
Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by "C. Michael Pilato" <cm...@collab.net>.

Mark Phippard wrote:
> It sounds like you are trying to solve more problems than just this one, 
> which is fine.  I just did not realize that people were trying to do that.

Oh.  Well, I'll hold back on my patch for the new 'svn worldwide-peace'
subcommand then.  (It's got some compatibility problems I haven't yet
worked out anyway.)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problem with large files

Posted by Mark Phippard <ma...@softlanding.com>.

"C. Michael Pilato" <cm...@collab.net> wrote on 08/28/2006 11:35:10 AM:

> We do client->server deltas using our pristine text-bases for a reason
> -- to reduce the cost of network transfer.  We know that on some
> networks, this matters (trust me -- many folks I know still use 56k
> dialup lines), and on some, it really doesn't.  But Subversion doesn't
> know which networks are which.  Only humans do.  And the type of network
> in use isn't a property of the file in question, or even of the working
> copy in question (hello, laptops moving from place to place) -- it's a
> condition as ever-changing as the weather in Chicago that has to be
> evaluated anew each time a commit occurs.

I guess I was thinking that we were only trying to solve the case where we 
have a largeish binary file and we know, because the user told us, that 
the file is not going to deltify well.  This seems to be a very common 
problem and the speedup would come from skipping that step and just 
sending the entire file.  This problem has nothing to do with network 
speed because the real problem is that we are spending a lot of time 
trying to deltify the file and the delta being produced is not 
significantly smaller than the original.

It sounds like you are trying to solve more problems than just this one, 
which is fine.  I just did not realize that people were trying to do that.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by "C. Michael Pilato" <cm...@collab.net>.

Mark Phippard wrote:
> sussman@gmail.com wrote on 08/28/2006 09:57:43 AM:
> 
>> On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
>>
>>> The report from the one person who has ever tried it with large files
>>> was that it sped up commit times from 45 minutes to less than 5 ;)
>> I don't think that this is rocket science which requires testing.  :-)
>>  Of course, if you just insert new data directly into the stream
>> without trying to deltify it, it's gonna be way way faster.

Well, careful now -- a user's experience isn't just "speed of inserting
new data directly into the stream" versus "speed of deltifying the data
and then sticking it into the stream".  The user's experience also
includes cost of the respective wire transfers.

>> What's tricky here is coming up with a design.  Should the svn client
>> be magically deciding when to deltify or not, based on some heuristic?
>>  Or should it be controlled by the user via switches or
>> config-options?  We have a really long standing issue filed (like...
>> years old) about giving users the option to toggle compression on the
>> fly (something akin to 'cvs -zN').  Is that the interface we want?
> 
> I would like to an svn: property used for this so that is not something 
> that has to be entered into configuration files (with the exception of 
> auto props to set the property in the first place).
> 
> Perhaps the property could be named something like "svn:delta" or 
> "svn:deltify" with values of "none" and "normal".  This would allow us to 
> introduce specialty algorithms later if we wanted to add custom algorithms 
> that worked better on certain file types.

We do client->server deltas using our pristine text-bases for a reason
-- to reduce the cost of network transfer.  We know that on some
networks, this matters (trust me -- many folks I know still use 56k
dialup lines), and on some, it really doesn't.  But Subversion doesn't
know which networks are which.  Only humans do.  And the type of network
in use isn't a property of the file in question, or even of the working
copy in question (hello, laptops moving from place to place) -- it's a
condition as ever-changing as the weather in Chicago that has to be
evaluated anew each time a commit occurs.

I think CVS has the right idea here, allowing folks to specify in both
personal configuration files and at the command-line what their
compression options should be.  To some extent, we expose the same sort
of thing in our runtime configuration area (http-compression).  But we
only let folks play with compression (only one of several things we
employ to try to reduce network usage), and we only let them do so via
the runtime configuration.

My current thinking here is that we should add (and honor) runtime and
real-time options for disabling text deltas on the wire as a whole.
Alternatively, maybe allow for disabling text deltas on binary files (as
determined by svn:mime-type).

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problem with large files

Posted by Mark Phippard <ma...@softlanding.com>.

sussman@gmail.com wrote on 08/28/2006 09:57:43 AM:

> On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
> 
> > The report from the one person who has ever tried it with large files
> > was that it sped up commit times from 45 minutes to less than 5 ;)
> 
> I don't think that this is rocket science which requires testing.  :-)
>  Of course, if you just insert new data directly into the stream
> without trying to deltify it, it's gonna be way way faster.
> 
> What's tricky here is coming up with a design.  Should the svn client
> be magically deciding when to deltify or not, based on some heuristic?
>  Or should it be controlled by the user via switches or
> config-options?  We have a really long standing issue filed (like...
> years old) about giving users the option to toggle compression on the
> fly (something akin to 'cvs -zN').  Is that the interface we want?

I would like to an svn: property used for this so that is not something 
that has to be entered into configuration files (with the exception of 
auto props to set the property in the first place).

Perhaps the property could be named something like "svn:delta" or 
"svn:deltify" with values of "none" and "normal".  This would allow us to 
introduce specialty algorithms later if we wanted to add custom algorithms 
that worked better on certain file types.

Another option I could see would be to somehow base it on the mime type, 
but that would push virtually all of the configuration into the 
configuration files, which I do not think would be a good idea.

Is svn import doing deltification?  If so, if the new svndiff1 compression 
is available perhaps it should always just skip the deltification?

Mark 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Erik Huelsmann <eh...@gmail.com>.

On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:
>
> > The report from the one person who has ever tried it with large files
> > was that it sped up commit times from 45 minutes to less than 5 ;)
>
> I don't think that this is rocket science which requires testing.  :-)
>  Of course, if you just insert new data directly into the stream
> without trying to deltify it, it's gonna be way way faster.

As long as the pipe it has to go through can offer the capacity...

> What's tricky here is coming up with a design.  Should the svn client
> be magically deciding when to deltify or not, based on some heuristic?

Well, I can think of an algorithm, not a heuristic, but I think that
crosses every realistic boundary of layers: just send data until the
pipe fills up; take all the time it takes for the pipe to become
available to do delta-encoding, send (potentially encoded) data as
soon as the pipe becomes available.

This way, you'll get (roughly) maximum throughput on the network,
assuming 'just sending' outweighs encoding any time.

>  Or should it be controlled by the user via switches or
> config-options?  We have a really long standing issue filed (like...
> years old) about giving users the option to toggle compression on the
> fly (something akin to 'cvs -zN').  Is that the interface we want?

The open issue is for import only, but, why shouldn't we offer the
interface? I don't use cvs -z3 with all repositories, so, I use it as
a switch...

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Ben Collins-Sussman <su...@red-bean.com>.

On 8/28/06, Daniel Berlin <db...@dberlin.org> wrote:

> The report from the one person who has ever tried it with large files
> was that it sped up commit times from 45 minutes to less than 5 ;)

I don't think that this is rocket science which requires testing.  :-)
 Of course, if you just insert new data directly into the stream
without trying to deltify it, it's gonna be way way faster.

What's tricky here is coming up with a design.  Should the svn client
be magically deciding when to deltify or not, based on some heuristic?
 Or should it be controlled by the user via switches or
config-options?  We have a really long standing issue filed (like...
years old) about giving users the option to toggle compression on the
fly (something akin to 'cvs -zN').  Is that the interface we want?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by listman <li...@burble.net>.

sorry, ignore my previous email. you'd pretty much answered that  
question in your email below.

i'd really like to see svn capable of running in a "fast mode". this  
would have the vdelta disabled and the MD5 checksums minimized.  
anybody else agree? can we have a --fast mode added to the 1.5 list  
of improvements?

thx.


On Aug 29, 2006, at 5:48 AM, Daniel Berlin wrote:

> On 8/29/06, listman <li...@burble.net> wrote:
>>
>> so vdelta and md5 are the culprits?
>>
>> anything we can do in the short to improve this?
> Try the patch i sent, which disables vdelta.
>
> Also, you could disable checksum verification on every single read
> from the repository, and add it to svnadmin verify, if you wanted.  I
> do this on gcc.gnu.org, trading immediate repo corruption detection
> for speed.
>
> (Why APR chose to reimplement MD5 instead of just taking advantage of
> OpenSSL, which is *installed* pretty much everywhere, and compilable
> everywhere else, and provides highly optimized versions, who knows)
>
> You will probably find vast resistance to disabling the repo read
> checksumming, but without doing that, you've pretty much lost the war
> for speed.
>
> This is true *even though* the client will detect the corruption
> anyway, because it is going to checksum it too and compare it to the
> repo checksum
>
> This is why you see multiple checksums happening.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by listman <li...@burble.net>.

but, as brandon saw, we are doing multiple MD5 checksums. do we  
really need all of them? is there a way to be more efficient here?


On Aug 29, 2006, at 5:48 AM, Daniel Berlin wrote:

> On 8/29/06, listman <li...@burble.net> wrote:
>>
>> so vdelta and md5 are the culprits?
>>
>> anything we can do in the short to improve this?
> Try the patch i sent, which disables vdelta.
>
> Also, you could disable checksum verification on every single read
> from the repository, and add it to svnadmin verify, if you wanted.  I
> do this on gcc.gnu.org, trading immediate repo corruption detection
> for speed.
>
> (Why APR chose to reimplement MD5 instead of just taking advantage of
> OpenSSL, which is *installed* pretty much everywhere, and compilable
> everywhere else, and provides highly optimized versions, who knows)
>
> You will probably find vast resistance to disabling the repo read
> checksumming, but without doing that, you've pretty much lost the war
> for speed.
>
> This is true *even though* the client will detect the corruption
> anyway, because it is going to checksum it too and compare it to the
> repo checksum
>
> This is why you see multiple checksums happening.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Daniel Berlin <db...@dberlin.org>.

On 8/29/06, listman <li...@burble.net> wrote:
>
> so vdelta and md5 are the culprits?
>
> anything we can do in the short to improve this?
Try the patch i sent, which disables vdelta.

Also, you could disable checksum verification on every single read
from the repository, and add it to svnadmin verify, if you wanted.  I
do this on gcc.gnu.org, trading immediate repo corruption detection
for speed.

(Why APR chose to reimplement MD5 instead of just taking advantage of
OpenSSL, which is *installed* pretty much everywhere, and compilable
everywhere else, and provides highly optimized versions, who knows)

You will probably find vast resistance to disabling the repo read
checksumming, but without doing that, you've pretty much lost the war
for speed.

This is true *even though* the client will detect the corruption
anyway, because it is going to checksum it too and compare it to the
repo checksum

This is why you see multiple checksums happening.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by listman <li...@burble.net>.

so vdelta and md5 are the culprits?

anything we can do in the short to improve this?

On Aug 28, 2006, at 10:38 PM, Brandon Ehle wrote:

> I resurrected the script and uploaded it here
>
> http://subversion.kicks-ass.org/genrepos.pl
>
> There are a couple of parameters you can specify for the type of  
> repository you want. I typically use the "medium" or "large" style  
> repositories for profiling.
>
> If you want to trash and make your machine box crash, try the  
> "everything" repository.  It creates a 5GB working copy and keeps  
> checking in binary file changes until your machine runs out of disk  
> space.
>
>
> Also, here is one of my recent KCachegrind profiles of a large  
> binary checkout operation over ra_local for 1.5.0-dev.
>
> http://subversion.kicks-ass.org/checkout.png
>
>
> There has been a bunch of improvements, since I last profiled this  
> (around version 0.28), but vdelta is still taking most of the time  
> with MD5 calculation a close second.
>
> Most if the vdelta time appears to be spent doing the two  
> comparisons and the branch in find_match_len().  Although this is  
> most likely related to the cache misses caused by find_match_len().
>
> It also appears that the MD5 sum for the checked out files are  
> calculated multiple times in multiple places during a ra_local  
> checkout and a large portion of the time is spent doing that.
>
>
> Brandon Ehle wrote:
>> I have a Perl script I made to profile this problem when I submitted
>> this problem to the bug tracker a couple years ago.
>> http://subversion.tigris.org/issues/show_bug.cgi?id=913
>> It will generate you an asset repository that simulates an artist
>> working on textures and generates as many revisions as you want.
>> I'll try to dig it back up and send it to you.
>> Daniel Berlin wrote:
>>> On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
>>>> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
>>>>> I suspect the problem here isn't about working copy efficiency,  
>>>>> it's
>>>>> the fact that we delta-encode every file that gets stuffed into  
>>>>> the
>>>>> repository, even if it's something as simple as committing a  
>>>>> file to a
>>>>> local file:/// repository.  That takes a lonnnnnnnng time on huge
>>>>> binary files.
>>>> That's why I was hoping Jeremy would hand some real world test  
>>>> cases
>>>> off to DannyB so he could make it Go Real Fast ;-)
>>>>
>>> I've emailed every person who, on users@ has complained in the  
>>> thread
>>> about large file binary performance, and begged them to give me  
>>> repos
>>> and files i can reproduce with, promising to fix their speed issues.
>>> I've even sent out the attached patch for testing
>>>
>>> I'm still waiting for an answer.  :-(
>>>
>>> They seem to want solutions without having to test them.
>>>
>>> The last time someone had a significant binary performance problem
>>> with large files, I sent them the attached (which disables  
>>> vdelta, and
>>> as such, is only really a good idea on svndiff1 using repos and
>>> networks with no 1.3 clients/servers).
>>> Basically, tell anyone who wants to try that they should  take this
>>> patch and create a new repo with a patched subversion, and dump/load
>>> the old repo into the new one, and give checkouts/etc a try.
>>>
>>> The report from the one person who has ever tried it with large  
>>> files
>>> was that it sped up commit times from 45 minutes to less than 5 ;)
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>>
>>> Index: text_delta.c
>>> ===================================================================
>>> --- text_delta.c	(revision 20792)
>>> +++ text_delta.c	(working copy)
>>> @@ -148,7 +148,8 @@ compute_window(const char *data, apr_siz
>>>    build_baton.new_data = svn_stringbuf_create("", pool);
>>>     if (source_len == 0)
>>> -    svn_txdelta__vdelta(&build_baton, data, source_len,  
>>> target_len, pool);
>>> +    svn_txdelta__insert_op(&build_baton, svn_txdelta_new, 0,  
>>> source_len,
>>> +                           data, pool);
>>>    else
>>>      svn_txdelta__xdelta(&build_baton, data, source_len,  
>>> target_len, pool);
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> ----
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>>> For additional commands, e-mail: dev-help@subversion.tigris.org
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>> For additional commands, e-mail: dev-help@subversion.tigris.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Brandon Ehle <az...@yahoo.com>.

I resurrected the script and uploaded it here

http://subversion.kicks-ass.org/genrepos.pl

There are a couple of parameters you can specify for the type of 
repository you want. I typically use the "medium" or "large" style 
repositories for profiling.

If you want to trash and make your machine box crash, try the 
"everything" repository.  It creates a 5GB working copy and keeps 
checking in binary file changes until your machine runs out of disk space.


Also, here is one of my recent KCachegrind profiles of a large binary 
checkout operation over ra_local for 1.5.0-dev.

http://subversion.kicks-ass.org/checkout.png


There has been a bunch of improvements, since I last profiled this 
(around version 0.28), but vdelta is still taking most of the time with 
MD5 calculation a close second.

Most if the vdelta time appears to be spent doing the two comparisons 
and the branch in find_match_len().  Although this is most likely 
related to the cache misses caused by find_match_len().

It also appears that the MD5 sum for the checked out files are 
calculated multiple times in multiple places during a ra_local checkout 
and a large portion of the time is spent doing that.


Brandon Ehle wrote:
> I have a Perl script I made to profile this problem when I submitted
> this problem to the bug tracker a couple years ago.
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=913
> 
> It will generate you an asset repository that simulates an artist
> working on textures and generates as many revisions as you want.
> 
> I'll try to dig it back up and send it to you.
> 
> 
> Daniel Berlin wrote:
>> On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
>>> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
>>>> I suspect the problem here isn't about working copy efficiency, it's
>>>> the fact that we delta-encode every file that gets stuffed into the
>>>> repository, even if it's something as simple as committing a file to a
>>>> local file:/// repository.  That takes a lonnnnnnnng time on huge
>>>> binary files.
>>> That's why I was hoping Jeremy would hand some real world test cases
>>> off to DannyB so he could make it Go Real Fast ;-)
>>>
>> I've emailed every person who, on users@ has complained in the thread
>> about large file binary performance, and begged them to give me repos
>> and files i can reproduce with, promising to fix their speed issues.
>> I've even sent out the attached patch for testing
>>
>> I'm still waiting for an answer.  :-(
>>
>> They seem to want solutions without having to test them.
>>
>> The last time someone had a significant binary performance problem
>> with large files, I sent them the attached (which disables vdelta, and
>> as such, is only really a good idea on svndiff1 using repos and
>> networks with no 1.3 clients/servers).
>> Basically, tell anyone who wants to try that they should  take this
>> patch and create a new repo with a patched subversion, and dump/load
>> the old repo into the new one, and give checkouts/etc a try.
>>
>> The report from the one person who has ever tried it with large files
>> was that it sped up commit times from 45 minutes to less than 5 ;)
>>
>>
>> ------------------------------------------------------------------------
>>
>> Index: text_delta.c
>> ===================================================================
>> --- text_delta.c	(revision 20792)
>> +++ text_delta.c	(working copy)
>> @@ -148,7 +148,8 @@ compute_window(const char *data, apr_siz
>>    build_baton.new_data = svn_stringbuf_create("", pool);
>>  
>>    if (source_len == 0)
>> -    svn_txdelta__vdelta(&build_baton, data, source_len, target_len, pool);
>> +    svn_txdelta__insert_op(&build_baton, svn_txdelta_new, 0, source_len,
>> +                           data, pool);
>>    else
>>      svn_txdelta__xdelta(&build_baton, data, source_len, target_len, pool);
>>    
>>
>>
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Brandon Ehle <az...@yahoo.com>.

I resurrected the script and uploaded it here

http://subversion.kicks-ass.org/genrepos.pl

There are a couple of parameters you can specify for the type of 
repository you want. I typically use the "medium" or "large" style 
repositories for profiling.

If you want to trash and make your machine box crash, try the 
"everything" repository.  It creates a 5GB working copy and keeps 
checking in binary file changes until your machine runs out of disk space.


Also, here is one of my recent KCachegrind profiles of a large binary 
checkout operation over ra_local for 1.5.0-dev.

http://subversion.kicks-ass.org/checkout.png


There has been a bunch of improvements, since I last profiled this 
(around version 0.28), but vdelta is still taking most of the time with 
MD5 calculation a close second.

Most if the vdelta time appears to be spent doing the two comparisons 
and the branch in find_match_len().  Although this is most likely 
related to the cache misses caused by find_match_len().

It also appears that the MD5 sum for the checked out files are 
calculated multiple times in multiple places during a ra_local checkout 
and a large portion of the time is spent doing that.


Brandon Ehle wrote:
> I have a Perl script I made to profile this problem when I submitted
> this problem to the bug tracker a couple years ago.
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=913
> 
> It will generate you an asset repository that simulates an artist
> working on textures and generates as many revisions as you want.
> 
> I'll try to dig it back up and send it to you.
> 
> 
> Daniel Berlin wrote:
>> On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
>>> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
>>>> I suspect the problem here isn't about working copy efficiency, it's
>>>> the fact that we delta-encode every file that gets stuffed into the
>>>> repository, even if it's something as simple as committing a file to a
>>>> local file:/// repository.  That takes a lonnnnnnnng time on huge
>>>> binary files.
>>> That's why I was hoping Jeremy would hand some real world test cases
>>> off to DannyB so he could make it Go Real Fast ;-)
>>>
>> I've emailed every person who, on users@ has complained in the thread
>> about large file binary performance, and begged them to give me repos
>> and files i can reproduce with, promising to fix their speed issues.
>> I've even sent out the attached patch for testing
>>
>> I'm still waiting for an answer.  :-(
>>
>> They seem to want solutions without having to test them.
>>
>> The last time someone had a significant binary performance problem
>> with large files, I sent them the attached (which disables vdelta, and
>> as such, is only really a good idea on svndiff1 using repos and
>> networks with no 1.3 clients/servers).
>> Basically, tell anyone who wants to try that they should  take this
>> patch and create a new repo with a patched subversion, and dump/load
>> the old repo into the new one, and give checkouts/etc a try.
>>
>> The report from the one person who has ever tried it with large files
>> was that it sped up commit times from 45 minutes to less than 5 ;)
>>
>>
>> ------------------------------------------------------------------------
>>
>> Index: text_delta.c
>> ===================================================================
>> --- text_delta.c	(revision 20792)
>> +++ text_delta.c	(working copy)
>> @@ -148,7 +148,8 @@ compute_window(const char *data, apr_siz
>>    build_baton.new_data = svn_stringbuf_create("", pool);
>>  
>>    if (source_len == 0)
>> -    svn_txdelta__vdelta(&build_baton, data, source_len, target_len, pool);
>> +    svn_txdelta__insert_op(&build_baton, svn_txdelta_new, 0, source_len,
>> +                           data, pool);
>>    else
>>      svn_txdelta__xdelta(&build_baton, data, source_len, target_len, pool);
>>    
>>
>>
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Brandon Ehle <az...@yahoo.com>.

I have a Perl script I made to profile this problem when I submitted
this problem to the bug tracker a couple years ago.

http://subversion.tigris.org/issues/show_bug.cgi?id=913

It will generate you an asset repository that simulates an artist
working on textures and generates as many revisions as you want.

I'll try to dig it back up and send it to you.


Daniel Berlin wrote:
> On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
>> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
>> > I suspect the problem here isn't about working copy efficiency, it's
>> > the fact that we delta-encode every file that gets stuffed into the
>> > repository, even if it's something as simple as committing a file to a
>> > local file:/// repository.  That takes a lonnnnnnnng time on huge
>> > binary files.
>>
>> That's why I was hoping Jeremy would hand some real world test cases
>> off to DannyB so he could make it Go Real Fast ;-)
>>
> 
> I've emailed every person who, on users@ has complained in the thread
> about large file binary performance, and begged them to give me repos
> and files i can reproduce with, promising to fix their speed issues.
> I've even sent out the attached patch for testing
> 
> I'm still waiting for an answer.  :-(
> 
> They seem to want solutions without having to test them.
> 
> The last time someone had a significant binary performance problem
> with large files, I sent them the attached (which disables vdelta, and
> as such, is only really a good idea on svndiff1 using repos and
> networks with no 1.3 clients/servers).
> Basically, tell anyone who wants to try that they should  take this
> patch and create a new repo with a patched subversion, and dump/load
> the old repo into the new one, and give checkouts/etc a try.
> 
> The report from the one person who has ever tried it with large files
> was that it sped up commit times from 45 minutes to less than 5 ;)
> 
> 
> ------------------------------------------------------------------------
> 
> Index: text_delta.c
> ===================================================================
> --- text_delta.c	(revision 20792)
> +++ text_delta.c	(working copy)
> @@ -148,7 +148,8 @@ compute_window(const char *data, apr_siz
>    build_baton.new_data = svn_stringbuf_create("", pool);
>  
>    if (source_len == 0)
> -    svn_txdelta__vdelta(&build_baton, data, source_len, target_len, pool);
> +    svn_txdelta__insert_op(&build_baton, svn_txdelta_new, 0, source_len,
> +                           data, pool);
>    else
>      svn_txdelta__xdelta(&build_baton, data, source_len, target_len, pool);
>    
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Daniel Berlin <db...@dberlin.org>.

On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> > I suspect the problem here isn't about working copy efficiency, it's
> > the fact that we delta-encode every file that gets stuffed into the
> > repository, even if it's something as simple as committing a file to a
> > local file:/// repository.  That takes a lonnnnnnnng time on huge
> > binary files.
>
> That's why I was hoping Jeremy would hand some real world test cases
> off to DannyB so he could make it Go Real Fast ;-)
>

I've emailed every person who, on users@ has complained in the thread
about large file binary performance, and begged them to give me repos
and files i can reproduce with, promising to fix their speed issues.
I've even sent out the attached patch for testing

I'm still waiting for an answer.  :-(

They seem to want solutions without having to test them.

The last time someone had a significant binary performance problem
with large files, I sent them the attached (which disables vdelta, and
as such, is only really a good idea on svndiff1 using repos and
networks with no 1.3 clients/servers).
Basically, tell anyone who wants to try that they should  take this
patch and create a new repo with a patched subversion, and dump/load
the old repo into the new one, and give checkouts/etc a try.

The report from the one person who has ever tried it with large files
was that it sped up commit times from 45 minutes to less than 5 ;)

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/28/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> I suspect the problem here isn't about working copy efficiency, it's
> the fact that we delta-encode every file that gets stuffed into the
> repository, even if it's something as simple as committing a file to a
> local file:/// repository.  That takes a lonnnnnnnng time on huge
> binary files.

That's why I was hoping Jeremy would hand some real world test cases
off to DannyB so he could make it Go Real Fast ;-)

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Ben Collins-Sussman <su...@red-bean.com>.

I suspect the problem here isn't about working copy efficiency, it's
the fact that we delta-encode every file that gets stuffed into the
repository, even if it's something as simple as committing a file to a
local file:/// repository.  That takes a lonnnnnnnng time on huge
binary files.

On 8/28/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> On 8/28/06, Jeremy Cook <je...@bccs.uib.no> wrote:
> > Thanks for the feedback, It wasn't all that obvious that I could solve
> > this by using a newer APR by googling and reading the FAQ. Large files
> > are mentioned in the FAQ, but not quite in a way that led me to a
> > solution.
>
> I'll take a look at the FAQ and see if I can find a place to mention
> the APR version issue.
>
> > Anyway I have now reinstalled using newer APR and APR-util and it does
> > indeed seem to be working with 2GB and 3GB files. However it all takes
> > so long using svn with these AVI files that I might find some other
> > way...
>
> For what it's worth, I believe there have been some changes in
> Subversion 1.4.x (not out yet, but real soon now) that may help this a
> bit.  There are a few parts of the working copy library that now do
> things in a streaming manner instead of using multiple passes over a
> given file (writing it to disk in between steps).  Not sure if that'll
> matter for your use case.
>
> Additionally, if you have a use case which is much faster in another
> system, I believe Dan Berlin recently mentioned on this list that he
> was curious about such things, and would be willing to try to make the
> necessary changes to Subversion to speed it up for those use cases, if
> he could get some test cases to work with.
>
> -garrett
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/28/06, Jeremy Cook <je...@bccs.uib.no> wrote:
> Thanks for the feedback, It wasn't all that obvious that I could solve
> this by using a newer APR by googling and reading the FAQ. Large files
> are mentioned in the FAQ, but not quite in a way that led me to a
> solution.

I'll take a look at the FAQ and see if I can find a place to mention
the APR version issue.

> Anyway I have now reinstalled using newer APR and APR-util and it does
> indeed seem to be working with 2GB and 3GB files. However it all takes
> so long using svn with these AVI files that I might find some other
> way...

For what it's worth, I believe there have been some changes in
Subversion 1.4.x (not out yet, but real soon now) that may help this a
bit.  There are a few parts of the working copy library that now do
things in a streaming manner instead of using multiple passes over a
given file (writing it to disk in between steps).  Not sure if that'll
matter for your use case.

Additionally, if you have a use case which is much faster in another
system, I believe Dan Berlin recently mentioned on this list that he
was curious about such things, and would be willing to try to make the
necessary changes to Subversion to speed it up for those use cases, if
he could get some test cases to work with.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Jeremy Cook <je...@bccs.uib.no>.

Thanks for the feedback, It wasn't all that obvious that I could solve
this by using a newer APR by googling and reading the FAQ. Large files
are mentioned in the FAQ, but not quite in a way that led me to a
solution.

Anyway I have now reinstalled using newer APR and APR-util and it does
indeed seem to be working with 2GB and 3GB files. However it all takes
so long using svn with these AVI files that I might find some other
way...

Thanks!

Jeremy

On 27/08/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> On 8/27/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
> >
> >
> >
> > On 8/23/06, Jeremy Cook <je...@bccs.uib.no> wrote:
> > >
> > > I am not all that familiar with the internals of subversion, nor do I
> > > want to be, but I do want it to handle large files. So if someone can
> > > reply and acknowledge whether this is a bug and whether there are any
> > > forseen problems with the fix that I found, I will be happy to make a
> > > report to the issue tracker.
> >
> >
> > I can't see why you'd need those special patches' to handle large files;  I
> > have no idea why the gentleman who posted those patches felt there was some
> > great complex problem to solve.
> >
> > All you need to do is recompile subversion against some version of apr and
> > apr-util > 1.0.  Subversion ships with apr 0.9 in its own tarball, but
> > that's only for legacy compatibility reasons.  If you rebuild against a
> > modern apr, large files should be handled just fine.  There's no need for
> > secret renegade patches.  :-)
>
> Note that there were bug fixes to APR that corrected some issues with
> large files.  Specifically, in APR 1.2.6 I fixed an issue that made
> lots of things with large files not work (the filePtr field in
> apr_file_t was only 32 bits long).  So you need at least APR 1.2.6.
> It's certainly possible that there are other problems lurking, but I'm
> not aware of any at the moment, and everything I tried with really big
> files worked once I corrected that issue.  With APR 0.9.x, this stuff
> is much more flaky, so I'd be surprised if it worked at all.
>
> -garrett
>


-- 
Jeremy.Cook@bccs.uib.no                        tlf: +47 55 58 40 65
Parallab                  Bergen Centre for Computational Science

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 8/27/06, Ben Collins-Sussman <su...@red-bean.com> wrote:
>
>
>
> On 8/23/06, Jeremy Cook <je...@bccs.uib.no> wrote:
> >
> > I am not all that familiar with the internals of subversion, nor do I
> > want to be, but I do want it to handle large files. So if someone can
> > reply and acknowledge whether this is a bug and whether there are any
> > forseen problems with the fix that I found, I will be happy to make a
> > report to the issue tracker.
>
>
> I can't see why you'd need those special patches' to handle large files;  I
> have no idea why the gentleman who posted those patches felt there was some
> great complex problem to solve.
>
> All you need to do is recompile subversion against some version of apr and
> apr-util > 1.0.  Subversion ships with apr 0.9 in its own tarball, but
> that's only for legacy compatibility reasons.  If you rebuild against a
> modern apr, large files should be handled just fine.  There's no need for
> secret renegade patches.  :-)

Note that there were bug fixes to APR that corrected some issues with
large files.  Specifically, in APR 1.2.6 I fixed an issue that made
lots of things with large files not work (the filePtr field in
apr_file_t was only 32 bits long).  So you need at least APR 1.2.6.
It's certainly possible that there are other problems lurking, but I'm
not aware of any at the moment, and everything I tried with really big
files worked once I corrected that issue.  With APR 0.9.x, this stuff
is much more flaky, so I'd be surprised if it worked at all.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Problem with large files

Posted by Ben Collins-Sussman <su...@red-bean.com>.

On 8/23/06, Jeremy Cook <je...@bccs.uib.no> wrote:
>
>
> I am not all that familiar with the internals of subversion, nor do I
> want to be, but I do want it to handle large files. So if someone can
> reply and acknowledge whether this is a bug and whether there are any
> forseen problems with the fix that I found, I will be happy to make a
> report to the issue tracker.

I can't see why you'd need those special patches' to handle large files;  I
have no idea why the gentleman who posted those patches felt there was some
great complex problem to solve.

All you need to do is recompile subversion against some version of apr and
apr-util > 1.0.  Subversion ships with apr 0.9 in its own tarball, but
that's only for legacy compatibility reasons.  If you rebuild against a
modern apr, large files should be handled just fine.  There's no need for
secret renegade patches.  :-)