You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Joe Schaefer <jo...@yahoo.com> on 2012/01/04 19:33:43 UTC

eliminating sequential bottlenecks for huge commit and merge ops

As Daniel mentioned to me on irc, subversion doesn't use threading
internally, so things like client side commit processing and merge
operations are done one file at at time IIUC.

Over in the openoffice podling we have a use-case for a 9GB working copy
that regularly sees churn on each file in the tree.  commit and merge
operations for such changes take upwards of 20min, and I'm wondering
if there's anything we could do here to reduce that processing time
by 2x or better by threading the per-dir processing somehow.

Thoughts?


Re: eliminating sequential bottlenecks for huge commit and merge ops

Posted by Stefan Fuhrmann <eq...@web.de>.
On 05.01.2012 01:35, Daniel Shahaf wrote:
> Greg Stein wrote on Wed, Jan 04, 2012 at 19:08:41 -0500:
>> (*) I'd be interested in what they are doing. Is this a use case we might
>> see elsewhere? Or is this something silly they are doing, that would not be
>> seen elsewhere?
> They use the Apache CMS[1] to manage their site[2,3].  Some changes (for
> example r1227057[4]) cause a site-wide rebuild; for example, the
> aforementioned r1227057 yielded r801653[5], which touches 19754 files
> (according to 'svn log -qv' on svn.eu).

That is nothing that should stress SVN too much.
After all, I did merges of that size with 1.4.

The critical part of the CMS use-case would be
the number of directories within the working copy
and the number of files touched (that need to be
read to check for actual changes).

As a data point: KDE /trunk@r1mio is 9.7GB
(>300k files in <100k folders). With a tuned svn://
implementation, I get an export within 17 seconds(!)
- 4 times in parallel [*]. More clients will ultimately
cause my machine to saturate at 6GB/sec sustained
svn: traffic over localhost.

So, we should really seek to optimize the working
copy implementation to support large working copies.

-- Stefan^2.

[*] The client side is an "empty" export command
that will simply accept any data coming from the server.

Re: eliminating sequential bottlenecks for huge commit and merge ops

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Greg Stein wrote on Wed, Jan 04, 2012 at 19:08:41 -0500:
> (*) I'd be interested in what they are doing. Is this a use case we might
> see elsewhere? Or is this something silly they are doing, that would not be
> seen elsewhere?

They use the Apache CMS[1] to manage their site[2,3].  Some changes (for
example r1227057[4]) cause a site-wide rebuild; for example, the
aforementioned r1227057 yielded r801653[5], which touches 19754 files
(according to 'svn log -qv' on svn.eu).

Daniel


[1]
http://www.apache.org/dev/cms

[2]
https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/
https://svn.apache.org/repos/infra/websites/staging/ooo-site/trunk/content/
http://mail-archives.apache.org/mod_mbox/incubator-ooo-commits/

[3]
http://ooo-site.apache.org/
http://www.openoffice.org/

[4]
http://mail-archives.apache.org/mod_mbox/incubator-ooo-commits/201201.mbox/%3C20120104070627.D8D0E2388900@eris.apache.org%3E
http://svn.apache.org/viewvc?rev=1227057&view=rev

[5]
of /repos/infra/websites/staging/ooo-site. No viewvc link.
http://mail-archives.apache.org/mod_mbox/incubator-ooo-commits/201201.mbox/%3C20120104075804.D06CA23889EB@eris.apache.org%3E

Re: eliminating sequential bottlenecks for huge commit and merge ops

Posted by Peter Samuelson <pe...@p12n.org>.
[Joe Schaefer]
> They're using the ASF CMS to manage the www.openoffice.org website,
> which is full of 10 years worth of accumulated legacy spanning 50 or
> so different natural languages.  The CMS is "too slow" during commits
> to template files or such which change the generated html content of
> virtually every file on the site.

Wait, so, the CMS regenerates stuff when you change a template.  Fair
enough, that's just caching.  But you are saying it also _stores_ the
generated content in a _repository_?  To me that would be, as they say,
"the real wtf".

I don't think Subversion was ever meant to be a backend for
archive.org, storing snapshots of generated websites.  Obviously it can
be used for that, but IMO much better to just generate the website,
store the html files on a flat filesystem, and regenerate from history
if you need history.

> 1) convert the templating system to use SSI, which would eliminate
> most of the sledgehammer type commits.

Of course SSI is an option, but just storing the generated files on a
normal filesystem instead of in a repository would (I suspect) be a
much less disruptive way to stop putting 9 GB of generated files in svn
at a time.

Re: eliminating sequential bottlenecks for huge commit and merge ops

Posted by Joe Schaefer <jo...@yahoo.com>.
>________________________________
> From: Greg Stein <gs...@gmail.com>
>To: Joe Schaefer <jo...@yahoo.com> 
>Cc: dev@subversion.apache.org 
>Sent: Wednesday, January 4, 2012 7:08 PM
>Subject: Re: eliminating sequential bottlenecks for huge commit and merge ops
> 
>
>
>On Jan 4, 2012 1:34 PM, "Joe Schaefer" <jo...@yahoo.com> wrote:
>>
>> As Daniel mentioned to me on irc, subversion doesn't use threading
>> internally, so things like client side commit processing and merge
>> operations are done one file at at time IIUC.
>>
>> Over in the openoffice podling we have a use-case for a 9GB working copy
>> that regularly sees churn on each file in the tree.  commit and merge
>> operations for such changes take upwards of 20min, and I'm wondering
>> if there's anything we could do here to reduce that processing time
>> by 2x or better by threading the per-dir processing somehow.
>>
>> Thoughts?
> We've always taken the position that the amount of effort or size of
> delta/data is proportional to the size of the change. If you change all
> of a 9Gb working copy, then you should expect svn to take a good chunk
> of time and space.
> IOW, stop doing that :-) 
> That said, even if we were desirous of "fixing" this(*), we would have 

> a hard time doing it using threads. The Subversion client is pretty solidly
> single-threaded. We take no precautions for operation in a multi-threaded app.

>

> Cheers,
> -g
> (*) I'd be interested in what they are doing. Is this a use case we might see
> elsewhere? Or is this something silly they are doing, that would not be seen elsewhere?


They're using the ASF CMS to manage the www.openoffice.org website, which is full
of 10 years worth of accumulated legacy spanning 50 or so different natural languages.
The CMS is "too slow" during commits to template files or such which change
the generated html content of virtually every file on the site.

There are 2 ways I could mitigate this issue with them if subversion isn't interested
in working on this use case:

1) convert the templating system to use SSI, which would eliminate most of the
sledgehammer type commits.


2) deploy the CMS on an SSD backed system.


FWIW (2) is scheduled to happen in the not too distant future anyway, and I personally
don't want to encourage the use of SSI with the CMS even for oddball situations
like this one.


Re: eliminating sequential bottlenecks for huge commit and merge ops

Posted by Greg Stein <gs...@gmail.com>.
On Jan 4, 2012 1:34 PM, "Joe Schaefer" <jo...@yahoo.com> wrote:
>
> As Daniel mentioned to me on irc, subversion doesn't use threading
> internally, so things like client side commit processing and merge
> operations are done one file at at time IIUC.
>
> Over in the openoffice podling we have a use-case for a 9GB working copy
> that regularly sees churn on each file in the tree.  commit and merge
> operations for such changes take upwards of 20min, and I'm wondering
> if there's anything we could do here to reduce that processing time
> by 2x or better by threading the per-dir processing somehow.
>
> Thoughts?

We've always taken the position that the amount of effort or size of
delta/data is proportional to the size of the change. If you change all of
a 9Gb working copy, then you should expect svn to take a good chunk of time
and space.

IOW, stop doing that :-)

That said, even if we were desirous of "fixing" this(*), we would have a
hard time doing it using threads. The Subversion client is pretty solidly
single-threaded. We take no precautions for operation in a multi-threaded
app.

Cheers,
-g

(*) I'd be interested in what they are doing. Is this a use case we might
see elsewhere? Or is this something silly they are doing, that would not be
seen elsewhere?