You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Küng <to...@gmail.com> on 2011/02/27 09:39:42 UTC

functions that would help TSVN

Hi,

When I was in London, Hyrum asked me to assemble a list with 
issues/tasks/functions that would help TSVN. So here it is:


new proplist function (svn_client_proplist4) where I can specify a 
filter/search string. Because right now, the proplist function returns 
all properties it finds (with depth infinity also for all paths). While 
this is a lot faster now than in 1.6, it still takes a few seconds for 
average sized working copies. Analyzing the bottlenecks I found that 
most time is spent allocating memory for all the returned data.
With a filter string I could specify just a property name and get only 
those properties/paths back that have that specific property set, which 
would reduce the memory allocations a *lot* and therefore be much 
faster. Best would be if I could specify not just one property to be 
searched for but several, that way I could ask for e.g. all "bugtraq:" 
properties we (and several other svn clients) use for integrating with 
issue trackers.

*******************

function to determine whether a path is part of a working copy.
Currently I'm misusing svn_wc_check_wc2() for this, but I'd rather have 
a proper and fast svn_client_is_workingcopy(const char* path) function 
for this.

*******************

function to find the wc root path for a given path. I thought there must 
be such a function already but I couldn't find one.

*******************

saved auth data
Currently if the user want to have the auth data saved, it is saved in 
subversion\auth\svn.simple (or another subfolder). But it is saved with 
the auth realm string. Of course that's the right way to do it, but I 
would like to have other info saved as well, like the repo root url 
and/or the repository uuid. And then some APIs that allow me to list the 
saved auth data, delete a specific file, retrieve the login username 
(not the password, for obvious reasons).

Use cases: users often want to login with a different username for a 
repository because they're using someone elses workstation/laptop for a 
day or two or they use a shared workstation but still want each commit 
assigned to the correct user (ugly, but happens more often than you 
might think). Right now there's no way to find out which saved auth file 
corresponds to a specific repository. So either users have to delete all 
the saved auth data for all repositories, or open each file in a text 
editor and then guess from the realm string which is the right file.
If the repo root url was saved as well, I could do this in TSVN and show 
the user a list of repositories and have him chose the one to clear the 
auth data for.

Another use case: retrieving the saved username would be useful for e.g. 
the integration with issue trackers. Usually the username is the same 
for the repository and the issue tracker (most setups work that way). 
Currently there's no way to filter the list of issues by username 
automatically for the issue tracker plugins because a username is not 
available. Using the repo root url or repo uuid which can be read from a 
working copy, I could then read out the saved username and provide that 
to the issue tracker plugins so they can filter for this username by 
default and reduce the list of open issues to that user.

*******************

function to retrieve the highest "last commit revision" of a whole 
working copy. Currently I'm using the status function to find that 
information. But all that's needed for such a function would be a db 
query with a simple comparison in the callback. That would be a lot faster.

*******************

A new field in the svn_client_status_t struct which has the size of the 
file in the working copy, or -1 if not known. For most file this 
information should be available automatically since svn_client_status 
has to do a stat() call on the file anyway to determine its file time or 
at least when comparing the size to its BASE. So if that information is 
available, I'd like to reuse that info and not have to do a stat() call 
again later, basically doubling the stat() calls and therefore hurting 
the performance a lot.

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net

Re: functions that would help TSVN

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Mar 01, 2011 at 01:20:20PM +0100, Stefan Küng wrote:
> Let me illustrate this a little bit:
> Assume: 1M properties in 100k folders
> svn_proplist recursive.
> Callback called 100k times
> for every callback:
>  - svn lib allocates memory for the data
>  - calls callback function, passes data
>  - UI client receives the data, copies the data to big memory buffer
>  - svn lib deallocates memory for data
> 
> memory allocations/deallocations are slow, especially in
> multi-threaded processes (meaning: not big of a problem for the CL
> client but for UI clients).
> In this scenario, there are 100k allocations and deallocations which
> could get reduced to one big allocation and one deallocation.

Oh, I see. If the overhead of copying between buffers hurts performance
that much, we can provide an API that passes a buffer to the application.

It would be interesting to implement this for proplist and then perform
measurements to see if it really makes that much of a difference.
Proplist is just one example where we need to pull data out of the
DB and give it to library callers. I anticipate additional APIs that
use the callback-passing approach in the future, so any insights gained
while experimenting with proplist will help.

Re: functions that would help TSVN

Posted by Stefan Küng <to...@gmail.com>.
On Tue, Mar 1, 2011 at 12:45, Stefan Sperling <st...@elego.de> wrote:
>> Since most UI clients need all the data in memory anyway, I'd like
>> to have a separate svn_client_proplist() API that does *one* db
>> query and returns all the results in one go.
>> There are several reasons:
>> * as mentioned, most UI clients will need all data in memory anyway.
>> For example in TSVN I just add the data in the callback to one big
>> list/vector/map and start using that data after the function
>> returns.
>
> I don't think we need a separate function that does the allocations
> on behalf of the callback.
> The callback is free to store the data in any way it wants.

I'm not requesting such function because I'm lazy and just want svn to
do the work for me. I'm requesting those for performance reasons. Of
course I'm free the store the data any way I want and need in a
callback.
Also I've never said that the current approach doesn't work, I only
mentioned that it's slower than necessary.

Let me illustrate this a little bit:
Assume: 1M properties in 100k folders
svn_proplist recursive.
Callback called 100k times
for every callback:
 - svn lib allocates memory for the data
 - calls callback function, passes data
 - UI client receives the data, copies the data to big memory buffer
 - svn lib deallocates memory for data

memory allocations/deallocations are slow, especially in
multi-threaded processes (meaning: not big of a problem for the CL
client but for UI clients).
In this scenario, there are 100k allocations and deallocations which
could get reduced to one big allocation and one deallocation.


>> * it is much faster (and I mean *much* faster here, from several
>> seconds or even minutes down to a few milliseconds or maybe two or
>> three seconds)
>> * in case there's not enough RAM available: I can always tell users
>> to install more RAM to get it working. But there's no way to make it
>> faster with the current callback implementation - there just are no
>> faster harddrives or much faster processors.
>
> If the callback takes care of allocations, it can fail more gracefully
> than the libraries can. E.g. the callback could decide to cancel the
> operation, or to display data it's already got, free some memory, and
> continue.
>
>> * the chance that there's not enough RAM available is very small:
>> assuming a million properties, each property using 1kb will result
>> in 1GB or RAM - most computers today have 3GB, new ones have 4GB and
>> more. So even in such rare situations with *huge* working copies the
>> chance of too less RAM is very small.
>
> Some operating systems still have resource limits that are lower than that.

Yes. What's your point?

>> So: for UI clients please provide fast APIs that use more RAM - keep
>> the existing APIs that use as less memory as possible for those
>> clients who need those.
>
> The libraries provide great flexibility with just one API.
> The existing API already gives you the option of using memory the way
> you want. So I don't see a reason to add a special-purpose API that
> does the allocation on behalf of the callback.

performance. performance.
That's what my whole post was about.
I never questioned the flexibility or that I wasn't able to get the
data I want with the existing APIs.
Again: I want those additional APIs for performance reasons.
Not because I can't get what I want with the existing APIs
Not because I'm too lazy to do the memory allocations myself in the callback
Not because the APIs aren't flexible enough
But because of performance.

I hope I made myself clear this time.

Stefan

-- 
       ___
  oo  // \\      "De Chelonian Mobile"
 (_,\/ \_/ \     TortoiseSVN
   \ \_/_\_/>    The coolest Interface to (Sub)Version Control
   /_/   \_\     http://tortoisesvn.net

Re: functions that would help TSVN

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Feb 28, 2011 at 08:35:44PM +0100, Stefan Küng wrote:
> One other thing I'd like to discuss: currently all svn functions use
> streams and provide the data in callbacks to save memory. While I
> fully understand that, I'd like to have at least the
> svn_client_proplist() function to also provide all results in one
> (big) memory hunk. Because right now, to save memory and to avoid
> timeout problems in the callback, svn_client_proplist() does a db
> query for each and every folder and then calls the callback function
> for every folder separately.
> But that is painfully slow if there are hundreds of folders in a
> working copy - one db query for every folder!

This is only true for the BASE tree. For the working tree, we already
use a single query to pull all properties out of the database at once
(see svn_wc__prop_list_recursive, called by svn_client_proplist).
The plan is to use this approach for the BASE tree, too.

> Since most UI clients need all the data in memory anyway, I'd like
> to have a separate svn_client_proplist() API that does *one* db
> query and returns all the results in one go.
> There are several reasons:
> * as mentioned, most UI clients will need all data in memory anyway.
> For example in TSVN I just add the data in the callback to one big
> list/vector/map and start using that data after the function
> returns.

I don't think we need a separate function that does the allocations
on behalf of the callback.
The callback is free to store the data in any way it wants.

> * it is much faster (and I mean *much* faster here, from several
> seconds or even minutes down to a few milliseconds or maybe two or
> three seconds)
> * in case there's not enough RAM available: I can always tell users
> to install more RAM to get it working. But there's no way to make it
> faster with the current callback implementation - there just are no
> faster harddrives or much faster processors.

If the callback takes care of allocations, it can fail more gracefully
than the libraries can. E.g. the callback could decide to cancel the
operation, or to display data it's already got, free some memory, and
continue.

> * the chance that there's not enough RAM available is very small:
> assuming a million properties, each property using 1kb will result
> in 1GB or RAM - most computers today have 3GB, new ones have 4GB and
> more. So even in such rare situations with *huge* working copies the
> chance of too less RAM is very small.

Some operating systems still have resource limits that are lower than that.

> So: for UI clients please provide fast APIs that use more RAM - keep
> the existing APIs that use as less memory as possible for those
> clients who need those.

The libraries provide great flexibility with just one API.
The existing API already gives you the option of using memory the way
you want. So I don't see a reason to add a special-purpose API that
does the allocation on behalf of the callback.

Re: functions that would help TSVN

Posted by Stefan Küng <to...@gmail.com>.
On 28.02.2011 15:22, C. Michael Pilato wrote:
> On 02/27/2011 03:39 AM, Stefan Küng wrote:
>> function to determine whether a path is part of a working copy.
>> Currently I'm misusing svn_wc_check_wc2() for this, but I'd rather have a
>> proper and fast svn_client_is_workingcopy(const char* path) function for this.
>
> I've often wondered about this myself.  Of all the "duh" functions you'd
> think we'd offer, this one is pretty near the top.  And yet folks typically
> find themselves just messing around with check_wc() or status() or ...
>
>> function to find the wc root path for a given path. I thought there must be
>> such a function already but I couldn't find one.
>
> There's this public one I added a few months ago or so:
>
> svn_error_t *
> svn_wc_get_wc_root(const char **wcroot_abspath,
>                     svn_wc_context_t *wc_ctx,
>                     const char *local_abspath,
>                     apr_pool_t *scratch_pool,
>                     apr_pool_t *result_pool);
>
> I more than happy to make an svn_client wrapper for this, and demote this to
> a private WC function, though!

Thanks! I haven't seen/found this one. I guess I have to adjust my 
search strings in the future.
I can use this function, no need to provide an svn_client wrapper. But 
maybe other clients could use such a wrapper...

>
>> A new field in the svn_client_status_t struct which has the size of the file
>> in the working copy, or -1 if not known. For most file this information
>> should be available automatically since svn_client_status has to do a stat()
>> call on the file anyway to determine its file time or at least when
>> comparing the size to its BASE. So if that information is available, I'd
>> like to reuse that info and not have to do a stat() call again later,
>> basically doubling the stat() calls and therefore hurting the performance a
>> lot.
>
> Makes sense to me.
>


One other thing I'd like to discuss: currently all svn functions use 
streams and provide the data in callbacks to save memory. While I fully 
understand that, I'd like to have at least the svn_client_proplist() 
function to also provide all results in one (big) memory hunk. Because 
right now, to save memory and to avoid timeout problems in the callback, 
svn_client_proplist() does a db query for each and every folder and then 
calls the callback function for every folder separately.
But that is painfully slow if there are hundreds of folders in a working 
copy - one db query for every folder!
Since most UI clients need all the data in memory anyway, I'd like to 
have a separate svn_client_proplist() API that does *one* db query and 
returns all the results in one go.
There are several reasons:
* as mentioned, most UI clients will need all data in memory anyway. For 
example in TSVN I just add the data in the callback to one big 
list/vector/map and start using that data after the function returns.
* it is much faster (and I mean *much* faster here, from several seconds 
or even minutes down to a few milliseconds or maybe two or three seconds)
* in case there's not enough RAM available: I can always tell users to 
install more RAM to get it working. But there's no way to make it faster 
with the current callback implementation - there just are no faster 
harddrives or much faster processors.
* the chance that there's not enough RAM available is very small: 
assuming a million properties, each property using 1kb will result in 
1GB or RAM - most computers today have 3GB, new ones have 4GB and more. 
So even in such rare situations with *huge* working copies the chance of 
too less RAM is very small.

So: for UI clients please provide fast APIs that use more RAM - keep the 
existing APIs that use as less memory as possible for those clients who 
need those.

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net

Re: functions that would help TSVN

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 02/27/2011 03:39 AM, Stefan Küng wrote:
> function to determine whether a path is part of a working copy.
> Currently I'm misusing svn_wc_check_wc2() for this, but I'd rather have a
> proper and fast svn_client_is_workingcopy(const char* path) function for this.

I've often wondered about this myself.  Of all the "duh" functions you'd
think we'd offer, this one is pretty near the top.  And yet folks typically
find themselves just messing around with check_wc() or status() or ...

> function to find the wc root path for a given path. I thought there must be
> such a function already but I couldn't find one.

There's this public one I added a few months ago or so:

svn_error_t *
svn_wc_get_wc_root(const char **wcroot_abspath,
                   svn_wc_context_t *wc_ctx,
                   const char *local_abspath,
                   apr_pool_t *scratch_pool,
                   apr_pool_t *result_pool);

I more than happy to make an svn_client wrapper for this, and demote this to
a private WC function, though!

> A new field in the svn_client_status_t struct which has the size of the file
> in the working copy, or -1 if not known. For most file this information
> should be available automatically since svn_client_status has to do a stat()
> call on the file anyway to determine its file time or at least when
> comparing the size to its BASE. So if that information is available, I'd
> like to reuse that info and not have to do a stat() call again later,
> basically doubling the stat() calls and therefore hurting the performance a
> lot.

Makes sense to me.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand