You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2004/05/25 19:56:31 UTC

Re: Speeding up blame (fwd)

On Tue, 2004-05-25 at 16:05, Peter N. Lundblad wrote:
> OK. This isn't my current prototype, but still we have this little
> handler. It will be used in a repository function as well as an RA
> function. So I'm not sure in which "namespace" it should live (svn_ra or
> svn_repos or just svn_, as the log receiver does). Also does it belong in
> svntypes.h. I don't think client code should depend on svn_repos.h and
> vice versa.

Sounds like svn_types.h and the svn_ namespace are appropriate.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Fri, 28 May 2004, Mark Benedetto King wrote:

> > > Right now, svn_client_blame() uses svn_diff_file_diff(),
> > > but there is also a generalized interface that could be applied
> > > to streams.  Using that interface, it should be possible to
> > > avoid constructing any of the fulltexts at all, since the
> > > stream of revision N+1 can be computed from the stream of
> > > revision N and the delta from N to N+1.
> > >
> > How do you apply this to a forward-only svn_stream_t? It needs to be able
> > to compare arbitrary tokens, it seems.
>
> Judging from the svn_diff_fns_t interface, the stream needs only to be
> restartable, not random-accessible.
>
What about token_compare? IN svn_diff_file.c, it seeks and reads the token
back if it isn't available in memory. Still, if it only needs to be
restartable, how do you rewind a svn_stream_t? Then, you need to keep the
whole contents.

> >
> >
> > In ra_local, we don't need deltification at all.  We just use
> > svn_fs_file_contents for each interesting revision.
> >
>
> I'm not sure what this will gain us; the deltas will still need
> to be combined somewhere.
>
If I understand the FS code correctly, to compute deltas between arbitrary
revisions, it gets a stream for each contents. Each of these streams will
combine deltas to produce fulltexts. Then, a new delta will be computed
from those fulltexts. (Someone will probably correct me her...)

So, when we use file://, we would:
- combine deltas to reproduce each fulltext
- compute a new delta of the fulltexts for consecutive revisions
- Take the previous fulltext (saved in a temporary file) and that delta
and calculate the next fulltext.

If we expose apr_file_ts in the RA function, we will instead:
- Combine deltas to produce each fulltext once, writing it to a temporary
file.

IN the network-base layers, we would still take the longer route, because
we optimize on network I/O, not CPU cycles.

So, this is so the ra_local will have less to do. If it is considered
ugly, I can expose deltas in the RA layer instead, letting the blame
command handle the temp files as it does today.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Fri, May 28, 2004 at 09:16:51PM +0200, Peter N. Lundblad wrote:
> > It would be neat if the fact that the WC adm area contains the
> > pristine BASE rev could be brought into play (perhaps the RA layer
> > would retrieve the delta from BASE->start rather than the fulltext
> > of start), but the wins in terms of network i/o are probably not
> > worth the additional implementation complexity.
> >
> I think this would be an optimization that is worth it when blaming a
> small revision range near BASE. But then you have little data transfered
> anyway... Don't think it is worth the complexity.
> 

I agree.  

> > One last blue sky idea:
> >
> > Right now, svn_client_blame() uses svn_diff_file_diff(),
> > but there is also a generalized interface that could be applied
> > to streams.  Using that interface, it should be possible to
> > avoid constructing any of the fulltexts at all, since the
> > stream of revision N+1 can be computed from the stream of
> > revision N and the delta from N to N+1.
> >
> How do you apply this to a forward-only svn_stream_t? It needs to be able
> to compare arbitrary tokens, it seems.

Judging from the svn_diff_fns_t interface, the stream needs only to be
restartable, not random-accessible.

We already have a delta combiner...

> 
> 
> In ra_local, we don't need deltification at all.  We just use
> svn_fs_file_contents for each interesting revision.
> 

I'm not sure what this will gain us; the deltas will still need
to be combined somewhere.

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Thu, 27 May 2004, Mark Benedetto King wrote:

> On Thu, May 27, 2004 at 07:59:26PM +0200, Peter N. Lundblad wrote:
> > >
> > > I don't think that the RA implementation can reconstruct the fulltext
> > > from the deltas (at least not without some WC callbacks; if that's
> > > what you're proposing, then I think we're on roughly the same page).
> > >
> > You're right, ofcourse. It would have to keep track of the last fulltext.
>
> Well, not completely right.  The RA layer could get the fulltext of the
> start revision, and then deltas for the revisions thereafter.
>
What I was refering to was that the RA layer on the client needs to keep
track of the last fulltext in a temporary file for the previous revision
each time.

> It would be neat if the fact that the WC adm area contains the
> pristine BASE rev could be brought into play (perhaps the RA layer
> would retrieve the delta from BASE->start rather than the fulltext
> of start), but the wins in terms of network i/o are probably not
> worth the additional implementation complexity.
>
I think this would be an optimization that is worth it when blaming a
small revision range near BASE. But then you have little data transfered
anyway... Don't think it is worth the complexity.

> One last blue sky idea:
>
> Right now, svn_client_blame() uses svn_diff_file_diff(),
> but there is also a generalized interface that could be applied
> to streams.  Using that interface, it should be possible to
> avoid constructing any of the fulltexts at all, since the
> stream of revision N+1 can be computed from the stream of
> revision N and the delta from N to N+1.
>
How do you apply this to a forward-only svn_stream_t? It needs to be able
to compare arbitrary tokens, it seems.

Below is a little write-up of my current design plans for this little
improvement. Don't know if it is worth committing into notes temporarily.
Anyway, it is good to write ones thoughts down...



Blame Optimization Plan
=======================

This is a temporary document explaining the plans for optimizing the SVN
blame functionality.  It will be removed when this is implemented.

The Problem
-----------

Today, blame:
a) Transfers each relevant revision of a file in fulltext,
b) Needs a network round-trip for each revision of a file that is examined.

a) is a mostly a problem over slow links and b) is a problem over a WAN.

The Solution
------------

The proposed solution is to implement a new RA layer function:

svn_error_t *
get_file_revs(void *session_baton, const char *path, svn_revnum_t start,
  svn_revnum_t end, svn_ra_file_rev_handler_t handler, void *handler_baton,
  apr_pool_t *pool);
in the RA vtable.

This will call handler for a range of revisions of the file specified
by path and end.  It will start at the youngest revision at or before start,
where the file was changed if such a revision exists. Else, it will
use the first revision of the file in the repository.  It will then
call handler for each revision where the file was changed until end is
reached. handler will never be called for a revision larger then end.

NOTE: It is necessary to deliver a revision before or at start, so
that the blame implementation can differientiate betwwen changes made
in the first revision at or after start and the contents before
start.  The client can't just fetch the contents of the file in start
- 1 since that may be an unrelated object with the same path.  So we
choose between this and having to use a log call to determinte the
history before start.

svn_ra_file_rev_handler_t is a callback defined as follows:
typedef svn_error_t *
  (*svn_ra_file_rev_handler_t) (void *baton,
    const char *path, svn_revnum_t revision, const apr_hash_t *rev_props,
    apr_file_t *contents, const apr_hash_t *props,
    apr_pool_t *pool);

The file contents will be available in contents.  During a call to the
handler, contents from the last call will still be valid.  Also,
contents from the last call will live in the pool given to get_revs,
so it will be available after the call.  [Are these semantics too
weird for a general-purpose RA layer function?  Are they too blame-oriented?]

NOTE: The caller of get_file_revs will get fulltext rather than
deltas.  Deltas will, however, be used over network RA access
methods.  I don't see any need for giving the client text deltas
instead of fulltexts.  The blame command will recreate the files and
diff them anyway.  If a client needs deltas, it can easily create them
itself.

NOTE: We give the callback apr_file_ts instead of svn_streams so it
can read the files more than once.  The RA layer will need to create
temporary files anyway to be able to apply text deltas for the
following revisions.


Implementation
--------------

On the server side, get_file_revs will be implemented in terms of
svn_fs_node_history and a series of calls to
svn_fs_get_file_delta_stream .  The deltas will be sent over the wire
and turned into fulltexts on the client.  I've been considering a
svn_repos function for this, but since it will be duplicated in
ra_local anyway (see below), it might not be forth the little saving
in code.  [Is this a good design decision?]

In ra_local, we don't need deltification at all.  We just use
svn_fs_file_contents for each interesting revision.

On the client, and in ra_local, we will use open_tmp_file to ask the
client to create temporary files for us.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Thu, May 27, 2004 at 07:59:26PM +0200, Peter N. Lundblad wrote:
> >
> > I don't think that the RA implementation can reconstruct the fulltext
> > from the deltas (at least not without some WC callbacks; if that's
> > what you're proposing, then I think we're on roughly the same page).
> >
> You're right, ofcourse. It would have to keep track of the last fulltext.
> What type of WC callbacks are you envisioning?
> 
> Regards,
> //Peter
> 

Well, not completely right.  The RA layer could get the fulltext of the
start revision, and then deltas for the revisions thereafter.

It would be neat if the fact that the WC adm area contains the
pristine BASE rev could be brought into play (perhaps the RA layer
would retrieve the delta from BASE->start rather than the fulltext
of start), but the wins in terms of network i/o are probably not
worth the additional implementation complexity.

AFATG, it would be neat if svn_client_cat() did the same thing.

One last blue sky idea:

Right now, svn_client_blame() uses svn_diff_file_diff(),
but there is also a generalized interface that could be applied
to streams.  Using that interface, it should be possible to
avoid constructing any of the fulltexts at all, since the
stream of revision N+1 can be computed from the stream of
revision N and the delta from N to N+1.


--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
>
> I don't think that the RA implementation can reconstruct the fulltext
> from the deltas (at least not without some WC callbacks; if that's
> what you're proposing, then I think we're on roughly the same page).
>
You're right, ofcourse. It would have to keep track of the last fulltext.
What type of WC callbacks are you envisioning?

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Wed, May 26, 2004 at 07:14:44PM +0200, Peter N. Lundblad wrote:
> On Wed, 26 May 2004, Mark Benedetto King wrote:
> 
> > On Wed, May 26, 2004 at 09:29:43AM +0200, Peter N. Lundblad wrote:
> > > On Wed, 26 May 2004, Greg Hudson wrote:
> > >
> > > Another approach that springs to my mind is not to expose the deltas in
> > > the RA interface, so we would move my previous typedef to svn_repos
> > > (renaming it) and in svn_ra.h add
> > > typedef svn_error_t *(*svn_ra_file_rev_handler_t)
> > >     (void *baton,
> > >      const char *path,
> > >      svn_revnum_t revnum,
> > >      apr_hash_t *rev_props,
> > >      svn_stream_t **contents,
> > >      apr_hash_t *props,
> > >      apr_pool_t *pool);
> > >
> > > Then it could be implemented using the repository function for network RA
> > > implementations, but for ra_local, we don't need to calculate the deltas.
> > > I don't know if this is faster, but in any case, the client only cares
> > > about the whole file, so why give it a delta?
> > >
> >
> > If we're going through all the trouble of building a more efficient
> > blame interface, we might as go all the way and:
> >
> > 1.) save the server from the hassle of re-combining all the deltas
> >     to compute the fulltext of every rev.
> >
> Today, it has to recreate each fulltext, but does it have to do that in my
> proposed svn_repos interface (not that above, but in an earlier post),
> where we ask for deltas for each revision in increasing order? I'm not
> familira with the FS internals, so I don't know. Could you explain your
> above comment in more concrete terms? :-)
> 
> > 2.) eventually give the client an opportunity to do something
> >     sneaky w.r.t. deltas + svn_diff().  If we always send fulltexts,
> >     the change information is lost and must be recomputed.
> >
> As you see above, I propose to send deltas over the wire. That was the
> original motivation for me to start this. My question here
> was about libsvn_ra. Should the RA implementation (on the client side)
> reconstruct the fulltext, or will it be useful for libsvn_client to have
> the delta instead?
> 

I don't think that the RA implementation can reconstruct the fulltext
from the deltas (at least not without some WC callbacks; if that's
what you're proposing, then I think we're on roughly the same page).

--ben



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Wed, 26 May 2004, Mark Benedetto King wrote:

> On Wed, May 26, 2004 at 09:29:43AM +0200, Peter N. Lundblad wrote:
> > On Wed, 26 May 2004, Greg Hudson wrote:
> >
> > Another approach that springs to my mind is not to expose the deltas in
> > the RA interface, so we would move my previous typedef to svn_repos
> > (renaming it) and in svn_ra.h add
> > typedef svn_error_t *(*svn_ra_file_rev_handler_t)
> >     (void *baton,
> >      const char *path,
> >      svn_revnum_t revnum,
> >      apr_hash_t *rev_props,
> >      svn_stream_t **contents,
> >      apr_hash_t *props,
> >      apr_pool_t *pool);
> >
> > Then it could be implemented using the repository function for network RA
> > implementations, but for ra_local, we don't need to calculate the deltas.
> > I don't know if this is faster, but in any case, the client only cares
> > about the whole file, so why give it a delta?
> >
>
> If we're going through all the trouble of building a more efficient
> blame interface, we might as go all the way and:
>
> 1.) save the server from the hassle of re-combining all the deltas
>     to compute the fulltext of every rev.
>
Today, it has to recreate each fulltext, but does it have to do that in my
proposed svn_repos interface (not that above, but in an earlier post),
where we ask for deltas for each revision in increasing order? I'm not
familira with the FS internals, so I don't know. Could you explain your
above comment in more concrete terms? :-)

> 2.) eventually give the client an opportunity to do something
>     sneaky w.r.t. deltas + svn_diff().  If we always send fulltexts,
>     the change information is lost and must be recomputed.
>
As you see above, I propose to send deltas over the wire. That was the
original motivation for me to start this. My question here
was about libsvn_ra. Should the RA implementation (on the client side)
reconstruct the fulltext, or will it be useful for libsvn_client to have
the delta instead?

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Wed, May 26, 2004 at 09:29:43AM +0200, Peter N. Lundblad wrote:
> On Wed, 26 May 2004, Greg Hudson wrote:
> 
> > On Tue, 2004-05-25 at 16:44, Peter N. Lundblad wrote:
> > > Not really. As you can see below, I need svn_txdelta_window_handler_t,
> > > which is in svn_delta.h, which include svn_types. Hmmm...
> >
> > Another option is to use an iterative iterface (a la svn_fs_history)
> > rather than a callback interface in libsvn_repos.  Then only svn_ra
> > needs the callback type.  Contrived?  Possibly.  But an iterative
> > interface is often easier to use than a callback interface anyway, so it
> > makes some sense to provide one when we can.  And there's precedent for
> > do_log vs. svn_fs_history.
> >
> Yes, that would be possible.
> 
> Another approach that springs to my mind is not to expose the deltas in
> the RA interface, so we would move my previous typedef to svn_repos
> (renaming it) and in svn_ra.h add
> typedef svn_error_t *(*svn_ra_file_rev_handler_t)
>     (void *baton,
>      const char *path,
>      svn_revnum_t revnum,
>      apr_hash_t *rev_props,
>      svn_stream_t **contents,
>      apr_hash_t *props,
>      apr_pool_t *pool);
> 
> Then it could be implemented using the repository function for network RA
> implementations, but for ra_local, we don't need to calculate the deltas.
> I don't know if this is faster, but in any case, the client only cares
> about the whole file, so why give it a delta?
> 

If we're going through all the trouble of building a more efficient
blame interface, we might as go all the way and:

1.) save the server from the hassle of re-combining all the deltas
    to compute the fulltext of every rev.

2.) eventually give the client an opportunity to do something
    sneaky w.r.t. deltas + svn_diff().  If we always send fulltexts,
    the change information is lost and must be recomputed.

--ben




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Wed, 26 May 2004, Greg Hudson wrote:

> On Tue, 2004-05-25 at 16:44, Peter N. Lundblad wrote:
> > Not really. As you can see below, I need svn_txdelta_window_handler_t,
> > which is in svn_delta.h, which include svn_types. Hmmm...
>
> Another option is to use an iterative iterface (a la svn_fs_history)
> rather than a callback interface in libsvn_repos.  Then only svn_ra
> needs the callback type.  Contrived?  Possibly.  But an iterative
> interface is often easier to use than a callback interface anyway, so it
> makes some sense to provide one when we can.  And there's precedent for
> do_log vs. svn_fs_history.
>
Yes, that would be possible.

Another approach that springs to my mind is not to expose the deltas in
the RA interface, so we would move my previous typedef to svn_repos
(renaming it) and in svn_ra.h add
typedef svn_error_t *(*svn_ra_file_rev_handler_t)
    (void *baton,
     const char *path,
     svn_revnum_t revnum,
     apr_hash_t *rev_props,
     svn_stream_t **contents,
     apr_hash_t *props,
     apr_pool_t *pool);

Then it could be implemented using the repository function for network RA
implementations, but for ra_local, we don't need to calculate the deltas.
I don't know if this is faster, but in any case, the client only cares
about the whole file, so why give it a delta?

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by Greg Hudson <gh...@MIT.EDU>.
On Tue, 2004-05-25 at 16:44, Peter N. Lundblad wrote:
> Not really. As you can see below, I need svn_txdelta_window_handler_t,
> which is in svn_delta.h, which include svn_types. Hmmm...

Well, one option is to stick the type in svn_delta.h, then, although
that's a little odd.

Another option is to use an iterative iterface (a la svn_fs_history)
rather than a callback interface in libsvn_repos.  Then only svn_ra
needs the callback type.  Contrived?  Possibly.  But an iterative
interface is often easier to use than a callback interface anyway, so it
makes some sense to provide one when we can.  And there's precedent for
do_log vs. svn_fs_history.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Speeding up blame (fwd)

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Tue, 25 May 2004, Greg Hudson wrote:

> On Tue, 2004-05-25 at 16:05, Peter N. Lundblad wrote:
> > OK. This isn't my current prototype, but still we have this little
> > handler. It will be used in a repository function as well as an RA
> > function. So I'm not sure in which "namespace" it should live (svn_ra or
> > svn_repos or just svn_, as the log receiver does). Also does it belong in
> > svntypes.h. I don't think client code should depend on svn_repos.h and
> > vice versa.
>
> Sounds like svn_types.h and the svn_ namespace are appropriate.
>
Not really. As you can see below, I need svn_txdelta_window_handler_t,
which is in svn_delta.h, which include svn_types. Hmmm...

//Peter

Index: svn_types.h
===================================================================
--- svn_types.h	(revision 9884)
+++ svn_types.h	(arbetskopia)
@@ -314,6 +314,27 @@
     void *baton);


+/** Callback used to fetch all revisions in a range of a file.  It is called
+ * for each interesting @a revision of the file, named @a in that revision.
+ * @a rev_props is the revision properties for this revision.  If
+ * @a delta_handler is non-NULL on return, the delta for the file between
+ * the previous interesting revisin and this one is fed into that window,
+ * which is also passed delta_baton.  The property differences are provided in
+ * prop_changes, in the same format as returned by @c svn_prop_diffs.  @a pool
+ * might be used for allocations during the call, but objects allocated
+ * in it may not live between calls.  @a baton is the usual closure.
+ */
+typedef svn_error_t *(*svn_file_rev_handler_t)
+     (void *baton,
+      const char *path,
+      svn_revnum_t revision,
+      const apr_hash_t *rev_props,
+      svn_txdelta_window_handler_t *delta_handler,
+      void **delta_baton,
+      const apr_array_header_t *prop_changes,
+      apr_pool_t *pool);
+
+
 /** The maximum amount we (ideally) hold in memory at a time when
  * processing a stream of data.
  *

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org