You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Küng <to...@gmail.com> on 2011/02/05 10:28:20 UTC

request for new API function

Hi,

To find all files and folders that have a specific property set I need 
to crawl the whole working copy and fetch the properties of each and 
every item, then scan the returned property list for that property.
But WC-NG uses an SQLite db so this task should be much faster with a 
lot less disk access.

I'd like to ask for a new API which would allow me to search the 
database for properties much faster. Something like

svn_wc_propsearch()

with parameters:

* string to search for
* value indicating how the search is done (equal, contains, begins with, 
... basically all the possible match functions SQLite offers)
* bool indicating whether the search is done on the NODES table or the 
ACTUAL_NODE table
* callback function which receives the search results

The callback function would receive not just the properties and paths of 
the items the found properties belong to, but also any information that 
the table row contains (because when an information is available for 
free, an API should return it to avoid forcing clients to call yet 
another API to get that info - if clients don't need the info they can 
just ignore it).

And while we're at it: new APIs to search for other columns in the NODES 
and ACTUAL_NODE table would also be nice and useful:
* search for 'changed_revision' and/or 'changed_date': allows to show a 
view with all files that haven't changed for a long time, or files that 
have changed recently, or ...
* search for 'changed_author': allows to show a quick view of who 
changed what last, gather statistics, ...
* search for 'depth' to quickly determine whether the working copy is 
full or may be missing something
* search for 'file_external'

With such new APIs, clients could use the advantages of WC-NG too.

A lot of ideas I had couldn't be done before because it would have been 
just too slow. But now with WC-NG and the database it would be fast enough.

Thoughts?

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net

Re: request for new API function

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Feb 05, 2011 at 05:15:22PM +0100, Stefan Küng wrote:
> On 05.02.2011 16:46, Stefan Sperling wrote:
> >What if the amount of information requested simply doesn't fit
> >into memory? I'd prefer a system that cannot fail in this way.
> >I'd prefer passing the information to the client piece by piece,
> >and letting the client worry about where to store it.
> >If at all possible, the library shouldn't allocate huge slabs of
> >memory outside of the client's control.
> 
> Sure, that would be the ideal way. But if it's not possible to force
> clients to behave and if that's a requirement, allocating a lot of
> memory might be the only alternative.

The other alternative that's being considered is to run per-directory
queries in the compat code. Then we can invoke callbacks once per
directory, outside of an sqlite transaction context, so that callbacks
written for 1.6.x an earlier cannot cause deadlocks on the DB.

This might mean that code compiled against 1.6.x runs a bit slower
against 1.7. libraries. But it will still run and function correctly.
The amount of memory allocated would be a function of the contents of
one directory (vs. the content of an entire working copy...)

> >How is your query any different from the new proplist API and
> >implementation I added in r1039808? I think that provides what you need
> >(for the proplist case). It opens the db, runs a query on it and streams
> >the results to the client via a callback. Very low overhead.
> 
> Haven't tried that proplist API. I just reviewed it and it really is
> what I need. Too bad it got reverted. I hope it will get put back in
> soon.

I hope to be able to put it back in.

I would like to rev all affected callbacks, and document what they're
not allowed to do. If a callback behaves improperly, the entire working
copy can deadlock. Maybe we can even try to detect deadlocks and error
out (see http://svn.haxx.se/dev/archive-2011-01/0219.shtml).

This is really a design problem for the most part.
Implementing it is fairly simple once we've decided what we want to do.

It's nice to know that TortoiseSVN would be happy with this approach.

Stefan

Re: request for new API function

Posted by Stefan Küng <to...@gmail.com>.
On 05.02.2011 16:46, Stefan Sperling wrote:
> On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote:
>> On 05.02.2011 13:56, Stefan Sperling wrote:
>>
>>> I think we should go into this direction.
>>> In fact, I think we should simply change the existing APIs to use
>>> the fastest possible way of getting at information.
>>
>> Well, currently there is no API that does what I suggested
>> (basically return all results of a db query without even touching
>> any files in the WC or have do this for every file/folder
>> separately).
>
> Well, the svn_proplist case I'm looking at is the same thing.
> I want to answer the request "Give me all properties within this
> working copy" by issuing as few sqlite queries as possible.
>
> You are talking about such requests in general.
> I am talking about one specific instance (proplist).
> But essentially we want the same thing.

Yes, the proplist case is what I need.
But of course, the other cases I listed are also valid requests. For 
example "give me all entries that have the author XXX as the 
last-commit" or "give me all paths that were changed since date XXX". 
These could also be answered by sqlite queries alone.

Or even "give me all paths in this working copy" would also be a big 
help. Getting that information without the sqlite db would require 
crawling the whole working copy, resulting in a lot of disk access.

>> I've read up on that thread. It seems the problem you're facing
>> comes from the fact that you need to stay compatible with pre 1.7
>> APIs and clients, and the fact that you can't enforce clients to
>> behave, only to ask them to behave and then hope for the best.
>
> Yes.
>
>> However what I'm asking for here are *new* APIs which do something
>> no existing API currently does. So staying compatible wouldn't be a
>> problem.
>
> New APIs don't make the problems we have with existing APIs go away.
>
> The backwards compat problem doesn't affect TortoiseSVN, since
> you will simply provide builds linked to 1.7 and tell all users to upgrade.
>
> But we need to keep existing clients that were compiled against 1.6.x
> and earlier working.  So it's not "not a problem", it's just not
> TortoiseSVN's problem :)

Sorry, I didn't realize the problem already exists. I was under the 
impression that the new API that was reverted a day later only had that 
problem and that's why it got reverted.


>> And if you're worried about clients not behaving properly,
>> why not get rid of the callback completely and just return all
>> information at once in one big chunk of memory.
>> Talking about UI clients, this won't be a problem because they
>> usually have to store all information they receive in a callback
>> anyway so they have it ready to show in the UI. So for them, the
>> memory use wouldn't be bigger at all.
>
> Really? Even with gigantic working copies?
> What if the amount of information requested simply doesn't fit
> into memory? I'd prefer a system that cannot fail in this way.
> I'd prefer passing the information to the client piece by piece,
> and letting the client worry about where to store it.
> If at all possible, the library shouldn't allocate huge slabs of
> memory outside of the client's control.

Sure, that would be the ideal way. But if it's not possible to force 
clients to behave and if that's a requirement, allocating a lot of 
memory might be the only alternative.

>> Of course, those APIs I'm asking for might not be very useful for
>> existing APIs or other stuff that is done in the svn library. Those
>> might only be useful for some svn clients. But I hope that's not a
>> blocker for implementing those.
>
> I hope that we'll get a good set of APIs for 1.7 that will
> satisfy all clients out there, including TortoiseSVN.
> What these APIs will look like isn't set in stone yet.
>
>> I also thought of just query the SQLite db myself directly, but then
>> I don't like to do something that's not really allowed.
>> However: I did a quick test with the Check-for-modifications dialog
>> in TSVN. It has a feature where you can enable showing all
>> properties. To do that, a separate thread is started which lists all
>> properties of all items in the working copy. On one of my working
>> copies, this takes about 50 seconds. Using a simple SQLite query on
>> the NODE table took in average 1260ms. Parsing the data and
>> preparing it for use in the UI took another 3.5 seconds. Now *that*
>> a speed improvement I really like.
>
> How is your query any different from the new proplist API and
> implementation I added in r1039808? I think that provides what you need
> (for the proplist case). It opens the db, runs a query on it and streams
> the results to the client via a callback. Very low overhead.

Haven't tried that proplist API. I just reviewed it and it really is 
what I need. Too bad it got reverted. I hope it will get put back in soon.

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net

Re: request for new API function

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote:
> On 05.02.2011 13:56, Stefan Sperling wrote:
> 
> >I think we should go into this direction.
> >In fact, I think we should simply change the existing APIs to use
> >the fastest possible way of getting at information.
> 
> Well, currently there is no API that does what I suggested
> (basically return all results of a db query without even touching
> any files in the WC or have do this for every file/folder
> separately).

Well, the svn_proplist case I'm looking at is the same thing.
I want to answer the request "Give me all properties within this
working copy" by issuing as few sqlite queries as possible.

You are talking about such requests in general.
I am talking about one specific instance (proplist).
But essentially we want the same thing.

> I've read up on that thread. It seems the problem you're facing
> comes from the fact that you need to stay compatible with pre 1.7
> APIs and clients, and the fact that you can't enforce clients to
> behave, only to ask them to behave and then hope for the best.

Yes.

> However what I'm asking for here are *new* APIs which do something
> no existing API currently does. So staying compatible wouldn't be a
> problem.

New APIs don't make the problems we have with existing APIs go away.

The backwards compat problem doesn't affect TortoiseSVN, since
you will simply provide builds linked to 1.7 and tell all users to upgrade.

But we need to keep existing clients that were compiled against 1.6.x
and earlier working.  So it's not "not a problem", it's just not
TortoiseSVN's problem :)

> And if you're worried about clients not behaving properly,
> why not get rid of the callback completely and just return all
> information at once in one big chunk of memory.
> Talking about UI clients, this won't be a problem because they
> usually have to store all information they receive in a callback
> anyway so they have it ready to show in the UI. So for them, the
> memory use wouldn't be bigger at all.

Really? Even with gigantic working copies?
What if the amount of information requested simply doesn't fit
into memory? I'd prefer a system that cannot fail in this way.
I'd prefer passing the information to the client piece by piece,
and letting the client worry about where to store it.
If at all possible, the library shouldn't allocate huge slabs of
memory outside of the client's control.
 
> Of course, those APIs I'm asking for might not be very useful for
> existing APIs or other stuff that is done in the svn library. Those
> might only be useful for some svn clients. But I hope that's not a
> blocker for implementing those.

I hope that we'll get a good set of APIs for 1.7 that will
satisfy all clients out there, including TortoiseSVN.
What these APIs will look like isn't set in stone yet.

> I also thought of just query the SQLite db myself directly, but then
> I don't like to do something that's not really allowed.
> However: I did a quick test with the Check-for-modifications dialog
> in TSVN. It has a feature where you can enable showing all
> properties. To do that, a separate thread is started which lists all
> properties of all items in the working copy. On one of my working
> copies, this takes about 50 seconds. Using a simple SQLite query on
> the NODE table took in average 1260ms. Parsing the data and
> preparing it for use in the UI took another 3.5 seconds. Now *that*
> a speed improvement I really like.

How is your query any different from the new proplist API and
implementation I added in r1039808? I think that provides what you need
(for the proplist case). It opens the db, runs a query on it and streams
the results to the client via a callback. Very low overhead.

Stefan

Re: request for new API function

Posted by Stefan Küng <to...@gmail.com>.
On 05.02.2011 13:56, Stefan Sperling wrote:

> I think we should go into this direction.
> In fact, I think we should simply change the existing APIs to use
> the fastest possible way of getting at information.

Well, currently there is no API that does what I suggested (basically 
return all results of a db query without even touching any files in the 
WC or have do this for every file/folder separately).

> Most code we have still crawls the working copy, and that is an
> artifact of how the 1.6.x working copy was structured.
> We're now at single DB, but we're not yet using the single DB to
> its full potential. We're currently treating it more or less like
> a key/value store for information about paths.
>
> Have you seen r1039808 and the resulting the "Sqlite and callbacks"
> thread on dev@? That thread describes some of the issue we're facing
> with the interaction of callbacks in our APIs and sqlite queries.
>
> There were two approaches discussed in that thread. I am currently
> experimenting with the "queries per-directory" approach (see r1051452
> and r1066541). I'm expecting this to be too slow, but I'm doing it
> anyway for two reasons. One is that we'll have real data to look at.
> The other is that we might need code that does per-directory queries
> anyway to satisfy backwards compatibility constraints (see the thread
> "sqlite and callbacks" thread for details).
>
> I think we will eventually need to query the database like people would
> normally query a database, letting sqlite do most of the work of pulling data
> out of the db. However we need to agree on how to solve problems with
> implications this has on the existing APIs.

I've read up on that thread. It seems the problem you're facing comes 
from the fact that you need to stay compatible with pre 1.7 APIs and 
clients, and the fact that you can't enforce clients to behave, only to 
ask them to behave and then hope for the best.

However what I'm asking for here are *new* APIs which do something no 
existing API currently does. So staying compatible wouldn't be a 
problem. And if you're worried about clients not behaving properly, why 
not get rid of the callback completely and just return all information 
at once in one big chunk of memory.
Talking about UI clients, this won't be a problem because they usually 
have to store all information they receive in a callback anyway so they 
have it ready to show in the UI. So for them, the memory use wouldn't be 
bigger at all.

Of course, those APIs I'm asking for might not be very useful for 
existing APIs or other stuff that is done in the svn library. Those 
might only be useful for some svn clients. But I hope that's not a 
blocker for implementing those.

I also thought of just query the SQLite db myself directly, but then I 
don't like to do something that's not really allowed.
However: I did a quick test with the Check-for-modifications dialog in 
TSVN. It has a feature where you can enable showing all properties. To 
do that, a separate thread is started which lists all properties of all 
items in the working copy. On one of my working copies, this takes about 
50 seconds. Using a simple SQLite query on the NODE table took in 
average 1260ms. Parsing the data and preparing it for use in the UI took 
another 3.5 seconds. Now *that* a speed improvement I really like.


Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net

Re: request for new API function

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Feb 05, 2011 at 01:56:41PM +0100, Stefan Sperling wrote:
> There were two approaches discussed in that thread. I am currently
> experimenting with the "queries per-directory" approach (see r1051452
> and r1066541).

Sorry, I meant r1050650, not r1051452.

Re: request for new API function

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Feb 05, 2011 at 10:28:20AM +0100, Stefan Küng wrote:
> Hi,
> 
> To find all files and folders that have a specific property set I
> need to crawl the whole working copy and fetch the properties of
> each and every item, then scan the returned property list for that
> property.
> But WC-NG uses an SQLite db so this task should be much faster with
> a lot less disk access.
> 
> I'd like to ask for a new API which would allow me to search the
> database for properties much faster. Something like
> 
> svn_wc_propsearch()
> 
> with parameters:
> 
> * string to search for
> * value indicating how the search is done (equal, contains, begins
> with, ... basically all the possible match functions SQLite offers)
> * bool indicating whether the search is done on the NODES table or
> the ACTUAL_NODE table
> * callback function which receives the search results
> 
> The callback function would receive not just the properties and
> paths of the items the found properties belong to, but also any
> information that the table row contains (because when an information
> is available for free, an API should return it to avoid forcing
> clients to call yet another API to get that info - if clients don't
> need the info they can just ignore it).
> 
> And while we're at it: new APIs to search for other columns in the
> NODES and ACTUAL_NODE table would also be nice and useful:
> * search for 'changed_revision' and/or 'changed_date': allows to
> show a view with all files that haven't changed for a long time, or
> files that have changed recently, or ...
> * search for 'changed_author': allows to show a quick view of who
> changed what last, gather statistics, ...
> * search for 'depth' to quickly determine whether the working copy
> is full or may be missing something
> * search for 'file_external'
> 
> With such new APIs, clients could use the advantages of WC-NG too.
> 
> A lot of ideas I had couldn't be done before because it would have
> been just too slow. But now with WC-NG and the database it would be
> fast enough.
> 
> Thoughts?

I think we should go into this direction.
In fact, I think we should simply change the existing APIs to use
the fastest possible way of getting at information.

Most code we have still crawls the working copy, and that is an
artifact of how the 1.6.x working copy was structured.
We're now at single DB, but we're not yet using the single DB to
its full potential. We're currently treating it more or less like
a key/value store for information about paths.

Have you seen r1039808 and the resulting the "Sqlite and callbacks"
thread on dev@? That thread describes some of the issue we're facing
with the interaction of callbacks in our APIs and sqlite queries.

There were two approaches discussed in that thread. I am currently
experimenting with the "queries per-directory" approach (see r1051452
and r1066541). I'm expecting this to be too slow, but I'm doing it
anyway for two reasons. One is that we'll have real data to look at.
The other is that we might need code that does per-directory queries
anyway to satisfy backwards compatibility constraints (see the thread
"sqlite and callbacks" thread for details).

I think we will eventually need to query the database like people would
normally query a database, letting sqlite do most of the work of pulling data
out of the db. However we need to agree on how to solve problems with
implications this has on the existing APIs.

Stefan