You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Eric Peers <er...@missinglinktools.com> on 2010/07/06 18:17:31 UTC

Performance of svn+ssh vs. file for multiple files

Howdy,

I've got a program that needs to checkout specific files at specific 
versions. In this particular case a branch does not make sense. I have 
found that the performance of svn+ssh in this case is very bad.

I run the rough equivalent of:
svn update -r 2 file1 file2 file3 file4 file5
svn update -r 3 file6 file7 file8 file9 file10

overall I have about 100 such files, and 2 svn update calls. I've 
accomplished this with an xargs frontend to svn so as to not overrun the 
cmdline.

if I use file:/// as a protocol, it runs in 3 seconds.
if I use svn+ssh:/// as a protocol, it takes 53 seconds.
if I run an svn update -r 3 with no files, it takes about 2s.

I wrote a direct svn api-program to accept the file lists, make the 
authentication a single time, and then call svn_update3. This still runs 
super slow. around 53s still.

I suspect the problem is because each individual file is called out, 
locked, etc. Is there a way to batch these locks together or improve 
performance? Cause the ssh channel/ra session to be reused?

Perusing the source code suggests that svn_client__update_internal will 
be called for each element in my paths. Since an individual file 
lock/svn directory write does not seem to be overly performance costly, 
I suspect the problem is in the svn_client__open_ra_session_internal + 
svn_ra_do_update2 calls from svn_client__update_internal? Is the 
subversion code opening a new ra_session for each of these files at the 
expense of an ssh+svnserve on the remote end? Is there a way to force a 
single RA session across all the files at an API level without writing 
my own svn_client__update_internal?

thoughts here?

thanks!
    --eric

Re: Performance of svn+ssh vs. file for multiple files

Posted by Nico Kadel-Garcia <nk...@gmail.com>.

On Tue, Jul 6, 2010 at 2:17 PM, Eric Peers <er...@missinglinktools.com> wrote:
> Howdy,
>
> I've got a program that needs to checkout specific files at specific
> versions. In this particular case a branch does not make sense. I have found
> that the performance of svn+ssh in this case is very bad.
>
> I run the rough equivalent of:
> svn update -r 2 file1 file2 file3 file4 file5
> svn update -r 3 file6 file7 file8 file9 file10

Ouch. Why not build a branch directory with a bunch of "svn:extern"
settings for this?

> overall I have about 100 such files, and 2 svn update calls. I've
> accomplished this with an xargs frontend to svn so as to not overrun the
> cmdline.
>
> if I use file:/// as a protocol, it runs in 3 seconds.
> if I use svn+ssh:/// as a protocol, it takes 53 seconds.
> if I run an svn update -r 3 with no files, it takes about 2s.

Doing individual SSH connections does have a very significant startup
cost on each session. It's pretty much built into the protocol

Can you do the checkout via svn or file, then "svn switch" to the
svn+ssh repository to push any changes?

You might check that your upstream SSH server has valid reverse DNS
for the IP addresses of your connecting clients: that's an old problem
involving DNS timesouts. If you can't get that, consider modifying
your upstream SSH server to not use reverse DNS lookups. This is not
configurable from OpenSSH config files unless things have changed
lately, and requires running the SSH daemon with 'sshd -u0'.

> I wrote a direct svn api-program to accept the file lists, make the
> authentication a single time, and then call svn_update3. This still runs
> super slow. around 53s still.
>
> I suspect the problem is because each individual file is called out, locked,
> etc. Is there a way to batch these locks together or improve performance?
> Cause the ssh channel/ra session to be reused?
>
> Perusing the source code suggests that svn_client__update_internal will be
> called for each element in my paths. Since an individual file lock/svn
> directory write does not seem to be overly performance costly, I suspect the
> problem is in the svn_client__open_ra_session_internal + svn_ra_do_update2
> calls from svn_client__update_internal? Is the subversion code opening a new
> ra_session for each of these files at the expense of an ssh+svnserve on the
> remote end? Is there a way to force a single RA session across all the files
> at an API level without writing my own svn_client__update_internal?
>
> thoughts here?
>
> thanks!
>   --eric

You're doing something complicated: slow performance is.... not unsurprising.

Re: Performance of svn+ssh vs. file for multiple files

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jul 06, 2010 at 02:46:11PM -0600, Eric Peers wrote:
> svnserve -d --listen-port 8000
> ssh epeers@localhost -L 3690:localhost:8000
> ...then run my svn update commands...

You could also try ssh connection multiplexing.

Re: Performance of svn+ssh vs. file for multiple files

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jul 06, 2010 at 02:46:11PM -0600, Eric Peers wrote:
> svnserve -d --listen-port 8000
> ssh epeers@localhost -L 3690:localhost:8000
> ...then run my svn update commands...

You could also try ssh connection multiplexing.

>From the ssh(1) man page:

     -M	     Places the ssh client into ``master'' mode for connection
	     sharing.  Multiple -M options places ssh into ``master'' mode
	     with confirmation required before slave connections are accepted.
	     Refer to the description of ControlMaster in ssh_config(5) for
	     details.

Stefan

Re: Performance of svn+ssh vs. file for multiple files

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric Peers wrote on Thu, 8 Jul 2010 at 20:13 -0000:
> @Les: tags/branches don't work in this case because an edit on this can change
> the tag/branch and because the merge of local edits + local version changes
> becomes cumbersome (if not impossible) on the svn switch to the branch/tag.
> Perforce style tagging does work, svn does not since it's a branch
> unfortunately. We did consider this option.
> 

Perhaps a 'tag' composed of file externals?

> one last q though: is the vtable->reparent the equivalent of a C++/Object
> Oriented Virtual Method? Where any given session (ssh, svnserve, file, http)
> can override as necessary?

Close, but not exactly.  A vtable could be compared to an abstract class; 
each of the four RA libraries (ra_local for file://, ra_svn for svn*://, 
ra_neon and ra_serf for http://) implements that vtable.  (We also use 
vtables in other places, e.g., between libsvn_fs and libsvn_fs_*.)

In practice, every library defines all vtable members, so it's 
s/can override/must define/.

Re: Performance of svn+ssh vs. file for multiple files

Posted by Stefan Sperling <st...@elego.de>.

On Thu, Jul 08, 2010 at 11:13:12AM -0600, Eric Peers wrote:
> I ended up writing a routine that uses the reparent call as
> previously discussed with a minor rework of the
> svn_client__update_internal to accomodate this. Overall time to
> update: 3.09s rather than 53s originally by reusing the session.
> Once I polish up the code, I'll post a copy on my blog if anybody
> wants it.

I'm stumbling into this conversion, but why put it on a blog?
What about submitting a patch instead?
http://subversion.apache.org/docs/community-guide/general.html#patches

Stefan

Re: Performance of svn+ssh vs. file for multiple files

Posted by Eric Peers <er...@missinglinktools.com>.

On 07/08/2010 02:27 AM, Daniel Shahaf wrote:
> Eric Peers wrote on Wed, 7 Jul 2010 at 04:44 -0000:
>    
>> Incidentally, where is [svn_ra_reparent] defined??? I can't
>> find it in the libraries, but I see it in libsvn_ra-1.so but not in the
>> libsvn_ra directory...
>>      
> % grep svn_ra_reparent tags
> svn_ra_reparent ./subversion/include/svn_ra.h   /^svn_ra_reparent(svn_ra_session_t *ra_session,$/;"     p       signature:(svn_ra_session_t *ra_session, const char *url, apr_pool_t *pool)
> svn_ra_reparent ./subversion/libsvn_ra/ra_loader.c      /^svn_error_t *svn_ra_reparent(svn_ra_session_t *session,$/;"   f       signature:(svn_ra_session_t *session, const char *url, apr_pool_t *pool)
>
>
> To save you some work: you'll see it calls vtable->reparent().  So the
> functions you *really* want are svn_ra__*_reparent():
>
> % grep _reparent tags | awk '{print $1,$2}' | grep -v tools/server-side/
> ra_svn_reparent ./subversion/libsvn_ra_svn/client.c
> svn_log__reparent ./subversion/include/private/svn_log.h
> svn_log__reparent ./subversion/libsvn_subr/log.c
> svn_ra_local__reparent ./subversion/libsvn_ra_local/ra_plugin.c
> svn_ra_neon__reparent ./subversion/libsvn_ra_neon/session.c
> svn_ra_reparent ./subversion/include/svn_ra.h
> svn_ra_reparent ./subversion/libsvn_ra/ra_loader.c
> svn_ra_serf__reparent ./subversion/libsvn_ra_serf/serf.c
> test_reparent ./subversion/bindings/swig/ruby/test/test_ra.rb
>
>    

I ended up writing a routine that uses the reparent call as previously 
discussed with a minor rework of the svn_client__update_internal to 
accomodate this. Overall time to update: 3.09s rather than 53s 
originally by reusing the session. Once I polish up the code, I'll post 
a copy on my blog if anybody wants it.

This is well within acceptable ranges for performance in my mind.

@Les: tags/branches don't work in this case because an edit on this can 
change the tag/branch and because the merge of local edits + local 
version changes becomes cumbersome (if not impossible) on the svn switch 
to the branch/tag. Perforce style tagging does work, svn does not since 
it's a branch unfortunately. We did consider this option.

Thanks Daniel!

one last q though: is the vtable->reparent the equivalent of a 
C++/Object Oriented Virtual Method? Where any given session (ssh, 
svnserve, file, http) can override as necessary?

    --Eric

Re: Performance of svn+ssh vs. file for multiple files

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric Peers wrote on Wed, 7 Jul 2010 at 04:44 -0000:
> Incidentally, where is [svn_ra_reparent] defined??? I can't
> find it in the libraries, but I see it in libsvn_ra-1.so but not in the
> libsvn_ra directory...

% grep svn_ra_reparent tags
svn_ra_reparent ./subversion/include/svn_ra.h   /^svn_ra_reparent(svn_ra_session_t *ra_session,$/;"     p       signature:(svn_ra_session_t *ra_session, const char *url, apr_pool_t *pool)
svn_ra_reparent ./subversion/libsvn_ra/ra_loader.c      /^svn_error_t *svn_ra_reparent(svn_ra_session_t *session,$/;"   f       signature:(svn_ra_session_t *session, const char *url, apr_pool_t *pool)


To save you some work: you'll see it calls vtable->reparent().  So the
functions you *really* want are svn_ra__*_reparent():

% grep _reparent tags | awk '{print $1,$2}' | grep -v tools/server-side/
ra_svn_reparent ./subversion/libsvn_ra_svn/client.c
svn_log__reparent ./subversion/include/private/svn_log.h
svn_log__reparent ./subversion/libsvn_subr/log.c
svn_ra_local__reparent ./subversion/libsvn_ra_local/ra_plugin.c
svn_ra_neon__reparent ./subversion/libsvn_ra_neon/session.c
svn_ra_reparent ./subversion/include/svn_ra.h
svn_ra_reparent ./subversion/libsvn_ra/ra_loader.c
svn_ra_serf__reparent ./subversion/libsvn_ra_serf/serf.c
test_reparent ./subversion/bindings/swig/ruby/test/test_ra.rb

Re: Performance of svn+ssh vs. file for multiple files

Posted by Les Mikesell <le...@gmail.com>.

On 7/6/2010 8:44 PM, Eric Peers wrote:
>
> I do need to update specific files - I basically need to replicate
> user's workspaces for per-file-per-version (think continuous build
> automation). Some files are included, others are not.

I'm not quite sure why you wouldn't maintain the versions you want for 
builds together in a branch where something like hudson would be able to 
do your build automation without extra contortions, but if that doesn't 
come naturally, why not copy the workspace you want to duplicate to a 
tag (making sure there are no local changes that haven't been committed 
first)?  Then a switch to the tag should give you want you need in one 
shot, not to mention making it a reproducible state knowing only the tag 
name.

-- 
   Les Mikesell
    lesmikesell@gmail.com

Re: Performance of svn+ssh vs. file for multiple files

Posted by Eric Peers <er...@missinglinktools.com>.

On 07/06/2010 06:16 PM, Bert Huijben wrote:
>
>    
>> -----Original Message-----
>> From: Eric Peers [mailto:eric@missinglinktools.com]
>> Sent: dinsdag 6 juli 2010 22:46
>> To: Daniel Shahaf
>> Cc: users@subversion.apache.org
>> Subject: Re: Performance of svn+ssh vs. file for multiple files
>>
>> Good suggestion Daniel. While this does markedly improve performance,
>> it
>> does so at the expense of changing the underlying protocol.
>> Unfortunately, I'm not at liberty to change the underlying protocol - I
>> have customers that define the protocol, I don't. So my "program" needs
>> to access their repos using their protocols.
>>
>> But the results:
>> ssh port forwarding to an active svnserve takes about 2.5s.
>> pure svnserve takes roughly 2s
>>
>> svnserve -d --listen-port 8000
>> ssh epeers@localhost -L 3690:localhost:8000
>> ...then run my svn update commands...
>>
>>      --eric
>>
>> On 07/06/2010 12:52 PM, Daniel Shahaf wrote:
>>      
>>> Have you tried using SSH port forwarding instead of svn+ssh://?
>>>
>>> Daniel
>>> (perhaps one of the other devs will address the points you made; I'm
>>> myself not familiar with that part of the code)
>>>
>>> Eric Peers wrote on Tue, 6 Jul 2010 at 21:17 -0000:
>>>
>>>        
>>>> Howdy,
>>>>
>>>> I've got a program that needs to checkout specific files at specific
>>>>          
>> versions.
>>      
>>>> In this particular case a branch does not make sense. I have found
>>>>          
>> that the
>>      
>>>> performance of svn+ssh in this case is very bad.
>>>>
>>>> I run the rough equivalent of:
>>>> svn update -r 2 file1 file2 file3 file4 file5
>>>> svn update -r 3 file6 file7 file8 file9 file10
>>>>
>>>> overall I have about 100 such files, and 2 svn update calls. I've
>>>>          
>> accomplished
>>      
>>>> this with an xargs frontend to svn so as to not overrun the cmdline.
>>>>          
> What you see is that updating a simple tree takes approximately the same
> amount of time: most of the time is spend on creating the connection and
> exchanging information on what to update.
>
> But in case of a recursive update below a root, all information is
> transferred at once. The 'svn update TARGET1 TARGET2.... TARGETn' performs a
> separate tree update for every individual selected target.
>
>
> To speed things up it would be interesting to know why you really need to
> update the specific files here, or can you switch to updating per directory?
> Do you need to skip a specific file?
>
> If you update specific files there are some kinds of changes you will never
> receive (e.g. New files or property changes on the directory), and you won't
> get a completely single revision working copy. (Which would actually give
> the best svn update performance as the client can just tell the repository:
> I have this tree at revision N; do you have changes for me?)
>
> 	Bert
>
>    
I agree: I see that a separate tree update happens for each individual 
target in my debug efforts. It appears the session setup times are the 
dominating factor.

I do need to update specific files - I basically need to replicate 
user's workspaces for per-file-per-version (think continuous build 
automation). Some files are included, others are not.

Is svn_ra_reparent the way to go here on an established connection with 
a custom update routine? Incidentally, where is this routine defined??? 
I can't find it in the libraries, but I see it in libsvn_ra-1.so but not 
in the libsvn_ra directory...

    --Eric

RE: Performance of svn+ssh vs. file for multiple files

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Eric Peers [mailto:eric@missinglinktools.com]
> Sent: dinsdag 6 juli 2010 22:46
> To: Daniel Shahaf
> Cc: users@subversion.apache.org
> Subject: Re: Performance of svn+ssh vs. file for multiple files
> 
> Good suggestion Daniel. While this does markedly improve performance,
> it
> does so at the expense of changing the underlying protocol.
> Unfortunately, I'm not at liberty to change the underlying protocol - I
> have customers that define the protocol, I don't. So my "program" needs
> to access their repos using their protocols.
> 
> But the results:
> ssh port forwarding to an active svnserve takes about 2.5s.
> pure svnserve takes roughly 2s
> 
> svnserve -d --listen-port 8000
> ssh epeers@localhost -L 3690:localhost:8000
> ...then run my svn update commands...
> 
>     --eric
> 
> On 07/06/2010 12:52 PM, Daniel Shahaf wrote:
> > Have you tried using SSH port forwarding instead of svn+ssh://?
> >
> > Daniel
> > (perhaps one of the other devs will address the points you made; I'm
> > myself not familiar with that part of the code)
> >
> > Eric Peers wrote on Tue, 6 Jul 2010 at 21:17 -0000:
> >
> >> Howdy,
> >>
> >> I've got a program that needs to checkout specific files at specific
> versions.
> >> In this particular case a branch does not make sense. I have found
> that the
> >> performance of svn+ssh in this case is very bad.
> >>
> >> I run the rough equivalent of:
> >> svn update -r 2 file1 file2 file3 file4 file5
> >> svn update -r 3 file6 file7 file8 file9 file10
> >>
> >> overall I have about 100 such files, and 2 svn update calls. I've
> accomplished
> >> this with an xargs frontend to svn so as to not overrun the cmdline.

What you see is that updating a simple tree takes approximately the same
amount of time: most of the time is spend on creating the connection and
exchanging information on what to update.

But in case of a recursive update below a root, all information is
transferred at once. The 'svn update TARGET1 TARGET2.... TARGETn' performs a
separate tree update for every individual selected target.

To speed things up it would be interesting to know why you really need to
update the specific files here, or can you switch to updating per directory?
Do you need to skip a specific file?

If you update specific files there are some kinds of changes you will never
receive (e.g. New files or property changes on the directory), and you won't
get a completely single revision working copy. (Which would actually give
the best svn update performance as the client can just tell the repository:
I have this tree at revision N; do you have changes for me?)

	Bert

Re: Performance of svn+ssh vs. file for multiple files

Posted by Eric Peers <er...@missinglinktools.com>.

Good suggestion Daniel. While this does markedly improve performance, it 
does so at the expense of changing the underlying protocol. 
Unfortunately, I'm not at liberty to change the underlying protocol - I 
have customers that define the protocol, I don't. So my "program" needs 
to access their repos using their protocols.

But the results:
ssh port forwarding to an active svnserve takes about 2.5s.
pure svnserve takes roughly 2s

svnserve -d --listen-port 8000
ssh epeers@localhost -L 3690:localhost:8000
...then run my svn update commands...

    --eric

On 07/06/2010 12:52 PM, Daniel Shahaf wrote:
> Have you tried using SSH port forwarding instead of svn+ssh://?
>
> Daniel
> (perhaps one of the other devs will address the points you made; I'm
> myself not familiar with that part of the code)
>
> Eric Peers wrote on Tue, 6 Jul 2010 at 21:17 -0000:
>    
>> Howdy,
>>
>> I've got a program that needs to checkout specific files at specific versions.
>> In this particular case a branch does not make sense. I have found that the
>> performance of svn+ssh in this case is very bad.
>>
>> I run the rough equivalent of:
>> svn update -r 2 file1 file2 file3 file4 file5
>> svn update -r 3 file6 file7 file8 file9 file10
>>
>> overall I have about 100 such files, and 2 svn update calls. I've accomplished
>> this with an xargs frontend to svn so as to not overrun the cmdline.
>>
>> if I use file:/// as a protocol, it runs in 3 seconds.
>> if I use svn+ssh:/// as a protocol, it takes 53 seconds.
>> if I run an svn update -r 3 with no files, it takes about 2s.
>>
>> I wrote a direct svn api-program to accept the file lists, make the
>> authentication a single time, and then call svn_update3. This still runs super
>> slow. around 53s still.
>>
>> I suspect the problem is because each individual file is called out, locked,
>> etc. Is there a way to batch these locks together or improve performance?
>> Cause the ssh channel/ra session to be reused?
>>
>> Perusing the source code suggests that svn_client__update_internal will be
>> called for each element in my paths. Since an individual file lock/svn
>> directory write does not seem to be overly performance costly, I suspect the
>> problem is in the svn_client__open_ra_session_internal + svn_ra_do_update2
>> calls from svn_client__update_internal? Is the subversion code opening a new
>> ra_session for each of these files at the expense of an ssh+svnserve on the
>> remote end? Is there a way to force a single RA session across all the files
>> at an API level without writing my own svn_client__update_internal?
>>
>> thoughts here?
>>
>> thanks!
>>     --eric
>>
>>
>>
>>
>>

Re: Performance of svn+ssh vs. file for multiple files

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Have you tried using SSH port forwarding instead of svn+ssh://?

Daniel
(perhaps one of the other devs will address the points you made; I'm 
myself not familiar with that part of the code)

Eric Peers wrote on Tue, 6 Jul 2010 at 21:17 -0000:
> Howdy,
> 
> I've got a program that needs to checkout specific files at specific versions.
> In this particular case a branch does not make sense. I have found that the
> performance of svn+ssh in this case is very bad.
> 
> I run the rough equivalent of:
> svn update -r 2 file1 file2 file3 file4 file5
> svn update -r 3 file6 file7 file8 file9 file10
> 
> overall I have about 100 such files, and 2 svn update calls. I've accomplished
> this with an xargs frontend to svn so as to not overrun the cmdline.
> 
> if I use file:/// as a protocol, it runs in 3 seconds.
> if I use svn+ssh:/// as a protocol, it takes 53 seconds.
> if I run an svn update -r 3 with no files, it takes about 2s.
> 
> I wrote a direct svn api-program to accept the file lists, make the
> authentication a single time, and then call svn_update3. This still runs super
> slow. around 53s still.
> 
> I suspect the problem is because each individual file is called out, locked,
> etc. Is there a way to batch these locks together or improve performance?
> Cause the ssh channel/ra session to be reused?
> 
> Perusing the source code suggests that svn_client__update_internal will be
> called for each element in my paths. Since an individual file lock/svn
> directory write does not seem to be overly performance costly, I suspect the
> problem is in the svn_client__open_ra_session_internal + svn_ra_do_update2
> calls from svn_client__update_internal? Is the subversion code opening a new
> ra_session for each of these files at the expense of an ssh+svnserve on the
> remote end? Is there a way to force a single RA session across all the files
> at an API level without writing my own svn_client__update_internal?
> 
> thoughts here?
> 
> thanks!
>    --eric
> 
> 
> 
>