You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Marc Strapetz <ma...@syntevo.com> on 2014/02/14 09:25:57 UTC

RFE: API for an efficient retrieval of server-side mergeinfo data

For SmartSVN we are optionally displaying merge arrows in the Revision
Graph. Here is a sample image, how this looks like:

http://imgur.com/MzrLq00

>From the JavaHL sources I understand that there is currently only one
method to retrieve server-side mergeinfo and this one works on a single
revision only:

Map<String, Mergeinfo> getMergeinfo(Iterable<String> paths,
                                    long revision,
                                    Mergeinfo.Inheritance inherit,
                                    boolean includeDescendants)

This makes the Merge Arrow feature practically unusable for larger graphs.

To improve performance, in earlier versions we were using a client-side
mergeinfo cache (similar as the main log-cache, which TSVN is using as
well). However, populating this cache (i.e. querying for mergeinfo for
*every* revision of the repository) often resulted in bringing the
entire Apache server down, especially if many users were building their
log cache at the same time.

To address these problems, it would be great to have a more powerful
API, which allows either to retrieve all mergeinfo for a *revision
range* or for a *set of revisions*.

Querying a set of revisions would be more flexible and would allow to
generate merge arrows on the fly. On the other hand, to alleviate the
server, it's desirable to cache retrieved mergeinfo on the client-side
anyway, hence a range query would be fine as well.

-Marc

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
On 14.02.2014 11:38, Julian Foad wrote:
> Marc Strapetz wrote:
>> For SmartSVN we are optionally displaying merge arrows in the Revision
>> Graph. Here is a sample image, how this looks like:
>>
>> http://imgur.com/MzrLq00
>>
>>> From the JavaHL sources I understand that there is currently only one
>>> method to retrieve server-side mergeinfo and this one works on a single
>>> revision only:
>>
>> Map<String, Mergeinfo> getMergeinfo(Iterable<String> paths,
>>                                     long revision,
>>                                     Mergeinfo.Inheritance inherit,
>>                                     boolean includeDescendants)
> 
> Right. This is a wrapper around the core library function svn_ra_get_mergeinfo().
> 
>> This makes the Merge Arrow feature practically unusable for larger graphs.
>>
>> To improve performance, in earlier versions we were using a client-side
>> mergeinfo cache (similar as the main log-cache, which TSVN is using as
>> well). However, populating this cache (i.e. querying for mergeinfo for
>> *every* revision of the repository) often resulted in bringing the
>> entire Apache server down, especially if many users were building their
>> log cache at the same time.
>>
>> To address these problems, it would be great to have a more powerful
>> API, which allows either to retrieve all mergeinfo for a *revision
>> range* or for a *set of revisions*.
> 
> The request for a more powerful API certainly makes sense, but what form of API?
> 
> In the Subversion project source code:
> 
>   # How many lines/bytes of mergeinfo in trunk, right now?
>   $ svn pg -R svn:mergeinfo | wc -lc
>     245   24063
> 
>   # How many branches and tags?
>   $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l
>   288
> 
>   # Approx. total lines/bytes mergeinfo per revision?
>   $ echo $((245 * 289)) $((24063 * 289))
>   70805 6954207
> 
> So in each revision  there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation.
> 
> The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions?
> 
> It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form.
> 
> Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo.

True, actually on the client-side we interested in the diff, anyway. So
some kind of callback:

interface MergeInfoDiffCallback {
  void mergeInfoDiff(int revision, Mergeinfo added, Mergeinfo removed);
}

would be convenient. This would work for revision ranges as well as a
set of revisions.

-Marc

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Branko Čibej wrote:

> On 14.02.2014 11:38, Julian Foad wrote:

>> Can we think of a better way to design the API so that it returns the
>> interesting data without all the redundancy? Basically I think we want
>> to describe changes to mergeinfo, rather than raw mergeinfo.
> 
> I wonder, Julian, could something like this be useful for merge in general?
> 
> We know that clients can cache most of the mergeinfo in the
> repository, if they want to; I just don't have any feeling for how
> much sense it would make to maintain such a cache, and if it can be
> made smart enough to speed up merging significantly.


I wasn't sure how much mergeinfo we fetch in a typical merge so I tried some merges with current svn branches. They all fetched mergeinfo either two or three times, all at the head revision, and the time taken to fetch it was not a substantial portion of the overall merge time. So I think the answer is we wouldn't currently benefit from this within the scope of one merge. (A persistent cache on the client machine is a different matter.)


- Julian


Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Branko Čibej <br...@wandisco.com>.
On 14.02.2014 11:38, Julian Foad wrote:
> Marc Strapetz wrote:
>> For SmartSVN we are optionally displaying merge arrows in the Revision
>> Graph. Here is a sample image, how this looks like:
>>
>> http://imgur.com/MzrLq00
>>
>>> From the JavaHL sources I understand that there is currently only one
>>> method to retrieve server-side mergeinfo and this one works on a single
>>> revision only:
>> Map<String, Mergeinfo> getMergeinfo(Iterable<String> paths,
>>                                     long revision,
>>                                     Mergeinfo.Inheritance inherit,
>>                                     boolean includeDescendants)
> Right. This is a wrapper around the core library function svn_ra_get_mergeinfo().
>
>> This makes the Merge Arrow feature practically unusable for larger graphs.
>>
>> To improve performance, in earlier versions we were using a client-side
>> mergeinfo cache (similar as the main log-cache, which TSVN is using as
>> well). However, populating this cache (i.e. querying for mergeinfo for
>> *every* revision of the repository) often resulted in bringing the
>> entire Apache server down, especially if many users were building their
>> log cache at the same time.
>>
>> To address these problems, it would be great to have a more powerful
>> API, which allows either to retrieve all mergeinfo for a *revision
>> range* or for a *set of revisions*.
> The request for a more powerful API certainly makes sense, but what form of API?
>
> In the Subversion project source code:
>
>   # How many lines/bytes of mergeinfo in trunk, right now?
>   $ svn pg -R svn:mergeinfo | wc -lc
>     245   24063
>
>   # How many branches and tags?
>   $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l
>   288
>
>   # Approx. total lines/bytes mergeinfo per revision?
>   $ echo $((245 * 289)) $((24063 * 289))
>   70805 6954207
>
> So in each revision  there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation.
>
> The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions?
>
> It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form.
>
> Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo.

I wonder, Julian, could something like this be useful for improving
merge in general?

We know that clients can cache most of the mergeinfo in the repository,
if they want to; I just don't have any feeling for how much sense it
would make to maintain such a cache, and if it can be made smart enough
to speed up merging significantly.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Branko Čibej <br...@wandisco.com>.
On 21.02.2014 15:50, Doug Robinson wrote:
> Julian:
>
> Given the required RA protocol changes, when could this change be
> shipped?  What version of SVN?

We treat a protocol extension the same way as an API extension: new
protocol-level features can only appear in minor version releases (e.g.,
1.9.0 or 1.10.0), and they must be implemented in such a way that they
do not affect older clients.

-- Brane


-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Doug Robinson wrote:

> Julian:
> 
> Given the required RA protocol changes, when could this change be
> shipped?  What version of SVN?


Hi Doug. A change like that could be shipped in a 1.x.0 version.

- Julian


> Julian Foad wrote:
>> Marc Strapetz wrote:
>>> Julian Foad wrote:
>>>> It looks like we have an agreement in principle. Would you like to file an
>>>> enhancement issue?
>>>
>>> Great. I've filed an issue now:
>>>
>>> http://subversion.tigris.org/issues/show_bug.cgi?id=4469

[...]

>> I talked with Brane about this and we discussed how it might make more
>> sense to do a higher level API. [...]

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Doug Robinson <do...@wandisco.com>.
Julian:

Given the required RA protocol changes, when could this change be shipped?
 What version of SVN?

Thank you.

Doug


On Wed, Feb 19, 2014 at 10:06 AM, Julian Foad <ju...@btopenworld.com>wrote:

> Marc Strapetz wrote:
> > Julian Foad wrote:
> >> It looks like we have an agreement in principle. Would you like to file
> an
> >> enhancement issue?
> >
> > Great. I've filed an issue now:
> >
> > http://subversion.tigris.org/issues/show_bug.cgi?id=4469
> >
> > Would you please review the various attributes (Subcomponent, ...)?
>
> That's great, thanks. I added a reference to this email thread, added
> myself to the CC list, and tweaked the type from 'feature' to 'enhancement'
> (just my personal interpretation) and schedule from '---' to 'unscheduled'
> (which just indicates I've thought about it and am stating that it's not
> currently tied to any particular release, it doesn't mean it has to stay
> that way).
>
> I talked with Brane about this and we discussed how it might make more
> sense to do a higher level API. Instead of asking "what is the absolute
> difference in the mergeinfo representations?" it could ask "What merges and
> other interesting events have occurred in the lifetime of this path?".
> There are a couple of reasons.
>
> The API as sketched so far is pretty straightforward, but even so the
> effort needed to implement it is not trivial. It requires RA protocol
> changes as well as all the layers of API change. The mergeinfo
> representation is subject to change. It feels like a backward step to
> invest effort in adding more support that is tied specifically to the
> current format.
>
> SmartSVN and other front ends like to be able to draw a merge graph. Even
> the 'svn mergeinfo' command-line command now draws a little ASCII-art graph
> showing limited information about the most recent merge. At present they
> all have to interpret mergeinfo themselves, at a pretty low level, and the
> interpretation is subtle and poorly understood. (I don't understand the
> edge cases related to adds and deletes properly, and I've been working with
> it for years.)
>
> So it seems like a good idea to encapsulate the interpretation of
> mergeinfo a bit more, and expose data in a form that is geared specifically
> towards explaining the history in the way that users can understand it.
> Maybe think of it as an extended 'log' operation, adding a small number of
> new notification types such as:
>
>   * there is a full merge into here, bringing in all the new changes
>       from PATH up to REV;
>   * there is a partial merge to here, bringing in some changes
>       from PATH between REV1 and REV2;
>
> What do you think of that sort of interface? Does your code already
> calculate something like that?
>
> - Julian
>
>


-- 
Douglas B. Robinson | *Senior Product Manager*

WANdisco // *Non-Stop Data*

t. 925-396-1125
e. doug.robinson@wandisco.com

-- 
Listed on the London Stock Exchange: WAND<http://www.bloomberg.com/quote/WAND:LN>

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE 
PRIVILEGED.  If this message was misdirected, WANdisco, Inc. and its 
subsidiaries, ("WANdisco") does not waive any confidentiality or privilege. 
 If you are not the intended recipient, please notify us immediately and 
destroy the message without disclosing its contents to anyone.  Any 
distribution, use or copying of this e-mail or the information it contains 
by other than an intended recipient is unauthorized.  The views and 
opinions expressed in this e-mail message are the author's own and may not 
reflect the views and opinions of WANdisco, unless the author is authorized 
by WANdisco to express such views or opinions on its behalf.  All email 
sent to or from this address is subject to electronic storage and review by 
WANdisco.  Although WANdisco operates anti-virus programs, it does not 
accept responsibility for any damage whatsoever caused by viruses being 
passed.


Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
On 19.02.2014 16:06, Julian Foad wrote:
> Marc Strapetz wrote:
>> Julian Foad wrote:
>>> It looks like we have an agreement in principle. Would you like
>>> to file an enhancement issue?
>> 
>> Great. I've filed an issue now:
>> 
>> http://subversion.tigris.org/issues/show_bug.cgi?id=4469
>> 
>> Would you please review the various attributes (Subcomponent,
>> ...)?
> 
> [...]
> 
> SmartSVN and other front ends like to be able to draw a merge graph.
> Even the 'svn mergeinfo' command-line command now draws a little
> ASCII-art graph showing limited information about the most recent
> merge. At present they all have to interpret mergeinfo themselves, at
> a pretty low level, and the interpretation is subtle and poorly
> understood. (I don't understand the edge cases related to adds and
> deletes properly, and I've been working with it for years.)
> So it seems like a good idea to encapsulate the interpretation of
> mergeinfo a bit more, and expose data in a form that is geared
> specifically towards explaining the history in the way that users can
> understand it. Maybe think of it as an extended 'log' operation,
> adding a small number of new notification types such as:
> 
> * there is a full merge into here, bringing in all the new changes 
> from PATH up to REV;
> * there is a partial merge to here, bringing in
> some changes from PATH between REV1 and REV2;
> 
> What do you think of that sort of interface?

That definitely sounds good. Just to note that the
extended-log-information should be easily receivable and cacheable for
the entire repository and it must be rich enough to easily extract
information for a specific path.

Examples:

- allow to include/exclude subtree merges for merge arrows

- allow merge arrow display for sub-directories and individual files

Ultimately, when having received all extended-log-information for all
revisions, one should be able to recreate raw svn:mergeinfo for all
paths of all revisions. I think this will guarantee that we won't miss
any possible use case when defining the protocol and data structures.

> Does your code already calculate something like that?

Yes, and I recall having a hard time when writing this code :)

-Marc

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Marc Strapetz wrote:
> Julian Foad wrote:
>> It looks like we have an agreement in principle. Would you like to file an 
>> enhancement issue?
> 
> Great. I've filed an issue now:
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=4469
> 
> Would you please review the various attributes (Subcomponent, ...)?

That's great, thanks. I added a reference to this email thread, added myself to the CC list, and tweaked the type from 'feature' to 'enhancement' (just my personal interpretation) and schedule from '---' to 'unscheduled' (which just indicates I've thought about it and am stating that it's not currently tied to any particular release, it doesn't mean it has to stay that way).

I talked with Brane about this and we discussed how it might make more sense to do a higher level API. Instead of asking "what is the absolute difference in the mergeinfo representations?" it could ask "What merges and other interesting events have occurred in the lifetime of this path?". There are a couple of reasons.

The API as sketched so far is pretty straightforward, but even so the effort needed to implement it is not trivial. It requires RA protocol changes as well as all the layers of API change. The mergeinfo representation is subject to change. It feels like a backward step to invest effort in adding more support that is tied specifically to the current format.

SmartSVN and other front ends like to be able to draw a merge graph. Even the 'svn mergeinfo' command-line command now draws a little ASCII-art graph showing limited information about the most recent merge. At present they all have to interpret mergeinfo themselves, at a pretty low level, and the interpretation is subtle and poorly understood. (I don't understand the edge cases related to adds and deletes properly, and I've been working with it for years.)

So it seems like a good idea to encapsulate the interpretation of mergeinfo a bit more, and expose data in a form that is geared specifically towards explaining the history in the way that users can understand it. Maybe think of it as an extended 'log' operation, adding a small number of new notification types such as:

  * there is a full merge into here, bringing in all the new changes
      from PATH up to REV;
  * there is a partial merge to here, bringing in some changes
      from PATH between REV1 and REV2;

What do you think of that sort of interface? Does your code already calculate something like that?

- Julian


Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
On 18.02.2014 15:26, Julian Foad wrote:
> Marc Strapetz wrote:
>> On 17.02.2014 18:36, Julian Foad wrote:
>>>   Marc Strapetz wrote:
>>>>   Hence an API like the following should work well for us:
>>>>
>>>>   interface MergeinfoDiffCallback {
>>>>    void mergeinfoDiff(int revision,
>>>>                       Map<String, Mergeinfo> pathToAddedMergeinfo,
>>>>                       Map<String, Mergeinfo> pathToRemovedMergeinfo);
>>>>   }
>>>>
>>>>   void getMergeinfoDiff(String rootPath,
>>>>                        long fromRev, long toRev,
>>>>                        MergeinfoDiffCallback callback)
>>>>                        throws ClientException;
>>>>
>>>>   This should give us all mergeinfo which affects any path at or below
>>>>   rootPath.
> [...]
>>> let's use the simpler version that's sufficient for your use case.
>>
>> That will be fine.
> [...]
>> From cache perspective it's easier to build the cache starting at r0:
>> [...] Anyway, I agree that receiving mergeinfo for more recent
>> revisions first is reasonable as well. Hence if you say the effort is
>> the same, then we could allow both: fromRev <= toRev, in which case we
>> will received mergeinfo in ascending order and fromRev > toRev in which
>> case it will be descending order?
> 
> Could do. It seems like a relatively minor decision.
> 
>>>> [...] important that ranges for which no mergeinfo diff is present
>>>>   will be processed quickly on the server-side, otherwise we could run
>>>>   into some kind of endless loop, if the cache building process is
>>>>   shutdown and resumed frequently.
>>>
>>>   [...] There is a client-side work-around: request ranges of say a thousand
>>> revisions at a time, and then you can easily keep track of how many of these
>>> requests have been completed.
>>
>> OK, that will work.
> 
> It looks like we have an agreement in principle. Would you like to file an enhancement issue?

Great. I've filed an issue now:

http://subversion.tigris.org/issues/show_bug.cgi?id=4469

Would you please review the various attributes (Subcomponent, ...)?

-Marc



Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Marc Strapetz wrote:
> On 17.02.2014 18:36, Julian Foad wrote:
>>  Marc Strapetz wrote:
>>>  Hence an API like the following should work well for us:
>>> 
>>>  interface MergeinfoDiffCallback {
>>>    void mergeinfoDiff(int revision,
>>>                       Map<String, Mergeinfo> pathToAddedMergeinfo,
>>>                       Map<String, Mergeinfo> pathToRemovedMergeinfo);
>>>  }
>>> 
>>>  void getMergeinfoDiff(String rootPath,
>>>                        long fromRev, long toRev,
>>>                        MergeinfoDiffCallback callback)
>>>                        throws ClientException;
>>> 
>>>  This should give us all mergeinfo which affects any path at or below
>>>  rootPath.
[...]
>> let's use the simpler version that's sufficient for your use case.
> 
> That will be fine.
[...]
> From cache perspective it's easier to build the cache starting at r0:
> [...] Anyway, I agree that receiving mergeinfo for more recent
> revisions first is reasonable as well. Hence if you say the effort is
> the same, then we could allow both: fromRev <= toRev, in which case we
> will received mergeinfo in ascending order and fromRev > toRev in which
> case it will be descending order?

Could do. It seems like a relatively minor decision.

>>> [...] important that ranges for which no mergeinfo diff is present
>>>  will be processed quickly on the server-side, otherwise we could run
>>>  into some kind of endless loop, if the cache building process is
>>>  shutdown and resumed frequently.
>> 
>>  [...] There is a client-side work-around: request ranges of say a thousand
>> revisions at a time, and then you can easily keep track of how many of these
>> requests have been completed.
> 
> OK, that will work.

It looks like we have an agreement in principle. Would you like to file an enhancement issue?

http://subversion.tigris.org/issues/

When you are logged in, that page includes links for filing a new issue. Please note that filing an issue doesn't affect whether or when the work will be done, but it's useful as a central place to refer to the task.

Do you have the resources to work on implementing this or are you looking for a volunteer?

- Julian

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
On 17.02.2014 18:36, Julian Foad wrote:
> Marc Strapetz wrote:
> 
>>> ... I'll dig into the cache code ...
>>
>> I did that now and the storage is quite simple: we have a main file
>> which contains the diff (added, removed) for every path in every
>> revision and a revision-based index file with constant record length (to
>> quickly locate entries in the main file).
>>
>> This storage allows to efficiently query for the mergeinfo diff for a
>> path in a certain revision. That's sufficient to build the merge arrows.
>> Assembling the complete mergeinfo for a certain revision is hard with
>> this cache, but actually not necessary for our use case.
>>
>> Hence an API like the following should work well for us:
>>
>> interface MergeinfoDiffCallback {
>>   void mergeinfoDiff(int revision,
>>                      Map<String, Mergeinfo> pathToAddedMergeinfo,
>>                      Map<String, Mergeinfo> pathToRemovedMergeinfo);
>> }
>>
>> void getMergeinfoDiff(String rootPath,
>>                       long fromRev, long toRev,
>>                       MergeinfoDiffCallback callback)
>>                       throws ClientException;
>>
>> This should give us all mergeinfo which affects any path at or below
>> rootPath.
>>
>> When disregarding our particular use case, a more consistent API could be:
>>
>> void getMergeinfoDiff(Iterable<String> paths,
>>                       long fromRev, long toRev,
>>                       Mergeinfo.Inheritance inherit,
>>                       boolean includeDescendants,    
>>                       MergeinfoDiffCallback callback)
>>                       throws ClientException;
> 
> I want to discourage callers from knowing or caring how the mergeinfo is stored, so I want to leave out the 'inherit' parameter.
> 
> I also think it makes sense not to offer the options of ignoring descendants (that is, subtree mergeinfo), or specifying multiple paths. After all, this is not a low level API to be used for implementing the mergeinfo subsystem, it's a high level query.
> 
> So let's use the simpler version that's sufficient for your use case.

That will be fine.

>> The mergeinfo diff should be received starting at fromRev and ending at
>> toRev. No callback is expected if there is no mergeinfo diff for a
>> certain revision. Depending on the server-side storage, we may require
>> to always have fromRev >= toRev or always fromRev <= toRev. If it
>> doesn't matter, better have always fromRev <= toRev (for reasons given
>> below).
> 
> The same procedure could work either forwards or backwards, it doesn't really matter as long as you know which way it is going. Often it is useful to know about the more recent changes first, and have the option to look back right to revision 0 if necessary.

>From cache perspective it's easier to build the cache starting at r0:
then cache files will contain information for older revision at lower
positions. This allows to crop files easily at a certain revision and
rebuild them from there. That's something we do, if a Log message is
modified from within the GUI (it might not play a role for mergeinfo,
though). Anyway, I agree that receiving mergeinfo for more recent
revisions first is reasonable as well. Hence if you say the effort is
the same, then we could allow both: fromRev <= toRev, in which case we
will received mergeinfo in ascending order and fromRev > toRev in which
case it will be descending order?

>> Regarding the usage, let's assume always fromRev <= toRev, then we will
>> invoke
>>
>> getMergeinfoDiff(cacheRoot, 0, head, callback)
>>
>> This should start returning mergeinfo diff immediately, starting at
>> revision 0, so we quickly make at least a bit of progress. Now, if the
>> cache building process is shutdown and restarted later, it will resume
>> with the latest known revision:
>>
>> getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback)
>>
>> This procedure will be performed until we have caught up with head.
>> Note, that the latestKnownRevision is the last revision for which we
>> have received a callback. Depending on the server-side storage, this may
>> be different from the current revision which the server is currently
>> processing at the time the cache building process is shutdown. Hence it
>> will be important that ranges for which no mergeinfo diff is present
>> will be processed quickly on the server-side, otherwise we could run
>> into some kind of endless loop, if the cache building process is
>> shutdown and resumed frequently.
> 
> Yes -- if the server takes a long time to work its way through a large range of (say a million) revisions where there are no mergeinfo changes, there is no graceful way to stop the procedure part way through, and no way to discover how far it has searched when you kill it. Maybe that is not important. There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed.

OK, that will work.

-Marc

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Branko Čibej <br...@wandisco.com>.
On 17.02.2014 22:25, Julian Foad wrote:
> I took a stab at writing the JavaHL boiler-plate code for this, as attached, though I'm unfamiliar with JavaHL. It seems to require modifying 5 java files and creating 3 new ones. Is that right, JavaHL experts? It seems a lot.

It's about right. Welcome to Java and JNI.

If this were a real attempt, we'd want to use the new jniwrapper for the
native code; see, for example, NativeStream.hpp/.cpp.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
I took a stab at writing the JavaHL boiler-plate code for this, as attached, though I'm unfamiliar with JavaHL. It seems to require modifying 5 java files and creating 3 new ones. Is that right, JavaHL experts? It seems a lot.

The implementation in the core library is empty, as yet, in the attached patch.

- Julian


>>  interface MergeinfoDiffCallback {
>>    void mergeinfoDiff(int revision,
>>                       Map<String, Mergeinfo> pathToAddedMergeinfo,
>>                       Map<String, Mergeinfo> pathToRemovedMergeinfo);
>>  }
>> 
>>  void getMergeinfoDiff(String rootPath,
>>                        long fromRev, long toRev,
>>                        MergeinfoDiffCallback callback)
>>                        throws ClientException;

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Marc Strapetz wrote:

>> ... I'll dig into the cache code ...
> 
> I did that now and the storage is quite simple: we have a main file
> which contains the diff (added, removed) for every path in every
> revision and a revision-based index file with constant record length (to
> quickly locate entries in the main file).
> 
> This storage allows to efficiently query for the mergeinfo diff for a
> path in a certain revision. That's sufficient to build the merge arrows.
> Assembling the complete mergeinfo for a certain revision is hard with
> this cache, but actually not necessary for our use case.
> 
> Hence an API like the following should work well for us:
> 
> interface MergeinfoDiffCallback {
>   void mergeinfoDiff(int revision,
>                      Map<String, Mergeinfo> pathToAddedMergeinfo,
>                      Map<String, Mergeinfo> pathToRemovedMergeinfo);
> }
> 
> void getMergeinfoDiff(String rootPath,
>                       long fromRev, long toRev,
>                       MergeinfoDiffCallback callback)
>                       throws ClientException;
> 
> This should give us all mergeinfo which affects any path at or below
> rootPath.
> 
> When disregarding our particular use case, a more consistent API could be:
> 
> void getMergeinfoDiff(Iterable<String> paths,
>                       long fromRev, long toRev,
>                       Mergeinfo.Inheritance inherit,
>                       boolean includeDescendants,    
>                       MergeinfoDiffCallback callback)
>                       throws ClientException;

I want to discourage callers from knowing or caring how the mergeinfo is stored, so I want to leave out the 'inherit' parameter.

I also think it makes sense not to offer the options of ignoring descendants (that is, subtree mergeinfo), or specifying multiple paths. After all, this is not a low level API to be used for implementing the mergeinfo subsystem, it's a high level query.

So let's use the simpler version that's sufficient for your use case.


> The mergeinfo diff should be received starting at fromRev and ending at
> toRev. No callback is expected if there is no mergeinfo diff for a
> certain revision. Depending on the server-side storage, we may require
> to always have fromRev >= toRev or always fromRev <= toRev. If it
> doesn't matter, better have always fromRev <= toRev (for reasons given
> below).

The same procedure could work either forwards or backwards, it doesn't really matter as long as you know which way it is going. Often it is useful to know about the more recent changes first, and have the option to look back right to revision 0 if necessary.

> Regarding the usage, let's assume always fromRev <= toRev, then we will
> invoke
> 
> getMergeinfoDiff(cacheRoot, 0, head, callback)
> 
> This should start returning mergeinfo diff immediately, starting at
> revision 0, so we quickly make at least a bit of progress. Now, if the
> cache building process is shutdown and restarted later, it will resume
> with the latest known revision:
> 
> getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback)
> 
> This procedure will be performed until we have caught up with head.
> Note, that the latestKnownRevision is the last revision for which we
> have received a callback. Depending on the server-side storage, this may
> be different from the current revision which the server is currently
> processing at the time the cache building process is shutdown. Hence it
> will be important that ranges for which no mergeinfo diff is present
> will be processed quickly on the server-side, otherwise we could run
> into some kind of endless loop, if the cache building process is
> shutdown and resumed frequently.

Yes -- if the server takes a long time to work its way through a large range of (say a million) revisions where there are no mergeinfo changes, there is no graceful way to stop the procedure part way through, and no way to discover how far it has searched when you kill it. Maybe that is not important. There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed.

OK, that sounds good enough.

- Julian

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
On 14.02.2014 14:18, Marc Strapetz wrote:
>>> Can we think of a better way to design the API so that it returns the 
>>> interesting data without all the redundancy? Basically I think we want to 
>>> describe changes to mergeinfo, rather than raw mergeinfo.
>>
>> Marc,
>>
>> Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data
>>  in your cache, and can the API get the data you want in the form you 
>> want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers.
> 
> Looks like our emails have crossed :) I'll dig into the cache code and
> will try to come back with a more detailed API suggestion soon.

I did that now and the storage is quite simple: we have a main file
which contains the diff (added, removed) for every path in every
revision and a revision-based index file with constant record length (to
quickly locate entries in the main file).

This storage allows to efficiently query for the mergeinfo diff for a
path in a certain revision. That's sufficient to build the merge arrows.
Assembling the complete mergeinfo for a certain revision is hard with
this cache, but actually not necessary for our use case.

Hence an API like the following should work well for us:

interface MergeinfoDiffCallback {
  void mergeinfoDiff(int revision,
                     Map<String, Mergeinfo> pathToAddedMergeinfo,
                     Map<String, Mergeinfo> pathToRemovedMergeinfo);
}

void getMergeinfoDiff(String rootPath,
                      long fromRev, long toRev,
                      MergeinfoDiffCallback callback)
                      throws ClientException;

This should give us all mergeinfo which affects any path at or below
rootPath.

When disregarding our particular use case, a more consistent API could be:

void getMergeinfoDiff(Iterable<String> paths,
                      long fromRev, long toRev,
                      Mergeinfo.Inheritance inherit,
                      boolean includeDescendants,	
                      MergeinfoDiffCallback callback)
                      throws ClientException;

The mergeinfo diff should be received starting at fromRev and ending at
toRev. No callback is expected if there is no mergeinfo diff for a
certain revision. Depending on the server-side storage, we may require
to always have fromRev >= toRev or always fromRev <= toRev. If it
doesn't matter, better have always fromRev <= toRev (for reasons given
below).

Regarding the usage, let's assume always fromRev <= toRev, then we will
invoke

getMergeinfoDiff(cacheRoot, 0, head, callback)

This should start returning mergeinfo diff immediately, starting at
revision 0, so we quickly make at least a bit of progress. Now, if the
cache building process is shutdown and restarted later, it will resume
with the latest known revision:

getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback)

This procedure will be performed until we have caught up with head.
Note, that the latestKnownRevision is the last revision for which we
have received a callback. Depending on the server-side storage, this may
be different from the current revision which the server is currently
processing at the time the cache building process is shutdown. Hence it
will be important that ranges for which no mergeinfo diff is present
will be processed quickly on the server-side, otherwise we could run
into some kind of endless loop, if the cache building process is
shutdown and resumed frequently.

-Marc

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Marc Strapetz wrote:

>>>  Can we think of a better way to design the API so that it returns the 
>>>  interesting data without all the redundancy? Basically I think we want
>>> to   describe changes to mergeinfo, rather than raw mergeinfo.
>> 
>>  Marc,
>> 
>>  Perhaps a better way to ask the question is: Can I encourage you to write 
>> the API that you want? You already designed a cache for the data. What is the 
>> shape of the data in your cache, and can the API get the data you want in the
>> form you   want it, directly? We'd be glad to help implement it. Even if you
>> start  with an API which simply iterates over a range of revisions, at least
>> that would  allow for the possibility of improving the efficiency internally
>> at various  layers.
> 
> Looks like our emails have crossed :) I'll dig into the cache code and
> will try to come back with a more detailed API suggestion soon.

Excellent! Thanks.

- Julian

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Marc Strapetz <ma...@syntevo.com>.
>> Can we think of a better way to design the API so that it returns the 
>> interesting data without all the redundancy? Basically I think we want to 
>> describe changes to mergeinfo, rather than raw mergeinfo.
> 
> Marc,
> 
> Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data
>  in your cache, and can the API get the data you want in the form you 
> want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers.

Looks like our emails have crossed :) I'll dig into the cache code and
will try to come back with a more detailed API suggestion soon.

-Marc


On 14.02.2014 14:09, Julian Foad wrote:
> I (Julian Foad) wrote:
> 
>> Can we think of a better way to design the API so that it returns the 
>> interesting data without all the redundancy? Basically I think we want to 
>> describe changes to mergeinfo, rather than raw mergeinfo.
> 
> Marc,
> 
> Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data
>  in your cache, and can the API get the data you want in the form you 
> want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers.
> 
> - Julian
> 

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
I (Julian Foad) wrote:

> Can we think of a better way to design the API so that it returns the 
> interesting data without all the redundancy? Basically I think we want to 
> describe changes to mergeinfo, rather than raw mergeinfo.

Marc,

Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data
 in your cache, and can the API get the data you want in the form you 
want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers.

- Julian

Re: RFE: API for an efficient retrieval of server-side mergeinfo data

Posted by Julian Foad <ju...@btopenworld.com>.
Marc Strapetz wrote:
> For SmartSVN we are optionally displaying merge arrows in the Revision
> Graph. Here is a sample image, how this looks like:
> 
> http://imgur.com/MzrLq00
> 
>> From the JavaHL sources I understand that there is currently only one
>> method to retrieve server-side mergeinfo and this one works on a single
>> revision only:
> 
> Map<String, Mergeinfo> getMergeinfo(Iterable<String> paths,
>                                     long revision,
>                                     Mergeinfo.Inheritance inherit,
>                                     boolean includeDescendants)

Right. This is a wrapper around the core library function svn_ra_get_mergeinfo().

> This makes the Merge Arrow feature practically unusable for larger graphs.
> 
> To improve performance, in earlier versions we were using a client-side
> mergeinfo cache (similar as the main log-cache, which TSVN is using as
> well). However, populating this cache (i.e. querying for mergeinfo for
> *every* revision of the repository) often resulted in bringing the
> entire Apache server down, especially if many users were building their
> log cache at the same time.
> 
> To address these problems, it would be great to have a more powerful
> API, which allows either to retrieve all mergeinfo for a *revision
> range* or for a *set of revisions*.

The request for a more powerful API certainly makes sense, but what form of API?

In the Subversion project source code:

  # How many lines/bytes of mergeinfo in trunk, right now?
  $ svn pg -R svn:mergeinfo | wc -lc
    245   24063

  # How many branches and tags?
  $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l
  288

  # Approx. total lines/bytes mergeinfo per revision?
  $ echo $((245 * 289)) $((24063 * 289))
  70805 6954207

So in each revision  there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation.

The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions?

It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form.

Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo.

- Julian



> Querying a set of revisions would be more flexible and would allow to
> generate merge arrows on the fly. On the other hand, to alleviate the
> server, it's desirable to cache retrieved mergeinfo on the client-side
> anyway, hence a range query would be fine as well.
> 
> -Marc
>