You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Johan Corveleyn <jc...@gmail.com> on 2010/08/11 02:07:49 UTC

Re: Looking to improve performance of svn annotate

Hi all,

Other priorities have unfortunately kept me from focusing on the
project of speeding up blame. But recently I've spent some time
thinking about it, reading the other mail threads, studying the code
and profiling a little bit. I hope I can still do something useful for
blame, whether it be sooner or later.

I will first give some updates, info and discussion, and end up with a
couple of questions. Any input is greatly appreciated. A lot of this
stuff is still very fuzzy to me :-).

First, some profiling info (After sprinkling some printf's and some
time-counters into blame.c):

- Most of the client-side blame-time (~90%) is spent on producing the
diffs (more specifically, the call to svn_diff_file_diff_2 in the
add_file_blame function in blame.c).

- Applying the deltas to produce full-texts in the first place accounts for ~5%.

- Time spent on producing the blame info (inserting/deleting blame
chunks) is negligible ( < 0.5%).

Coming back with this information to the threads that Mark Phippard suggested:

On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> If you haven't, you should review some of the old threads on speeding
> up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> timeframe and learned a lot about what does NOT speed it up.
>
> http://svn.haxx.se/dev/archive-2005-02/0275.shtml

This thread mainly ended up focusing on the "delta combiner", with the
conclusion of replacing vdelta with xdelta. I think this was a very
good optimization of which we now reap the benefits. If I understand
my profiling numbers correctly, the delta combining stuff is no longer
the biggest bottleneck (at least not on the client-side).

There is however another interesting point/question from this thread, see below.

> You should really search around for all of the threads he commented on
> to get the complete picture.  I think he also provided ideas on what
> more could potentially be done in the future.  Such as this one that I
> do not recall was ever done.
>
> http://svn.haxx.se/dev/archive-2007-09/0459.shtml
>
> Maybe the thread will explain why.

The thread didn't really explain why it wasn't implemented, but the
poster said he had little time to work on it further. The suggestions
were improvements in the processing of the blame info (optimizations
in iterating through the linked list of blame chunks, and discussion
about the most optimal data structure to represent these blame
chunks). The thread sort of ended with the question whether this would
be a significant optimization:
On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

>From my profiling data, it seems optimizing those blame_* functions
would indeed not be significant. Most of the time is spent in the
svn_diff_* functions.

Which brings us back to the idea of the binary blame, and Julian's
description of the algorithm. Now, some questions:

1) Dan Berlin's thread prompted an interesting response from Branko
Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> If by "diff format" you mean the binary delta, then... There was a time
> when I thought this would be possible. Now I'm not so sure. The trouble
> is that the vdelta doesn't generate an edit stream, it generates a
> compressed block-copy stream. Which means that can never be 100% sure,
> just from looking at the delta, which bytes in the target are really new
> and which are just (offset) copies from the source. The only blocks you
> can really be sure about are those that are represented by new data in
> the delta (either NEW blocks or target copies that take data from NEW
> blocks). This problem is made worse by our use of skip deltas in (at
> least) the BDB back-end.
>
> I agree that it would be nice if the server could generate some sort of
> byte-range oriented info, but I don't think it can be done just by
> looking at the deltas. It's sad, I know...

What? So this idea was considered before, but rejected? Is this
argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
around that, does it mean this simply cannot work? Maybe Branko can
comment on that? An example or some more detailed explanation would be
very welcome.

2) A concrete problem with the algorithm as described by Julian would
be the delete of bytes within a line. E.g. suppose one deletes a
single character in r60 ("oops, a boolean test was inverted, let's
delete that stray '!'"). That deleted character wouldn't be
represented in the binary blame structure (it's simply cut out), and
it's obviously not present in the final text. But the line on which it
occurred was now last edited in r60.

Current blame (diff-based) has no problem with this situation. It
correctly shows r60 as the last revision for that line. But the binary
algorithm loses information about the deleted bytes. That's a problem.
I haven't really worked out how to handle that. Anybody have any
ideas?

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.

On Tue, 2010-08-17 at 09:26 -0400, Johan Corveleyn wrote:
> Greg, could you explain a bit more what you mean with
> "edit-stream-style binary diffs", vs. the binary deltas we have now?
> Could you perhaps give an example similar to Julian's? Wouldn't you
> have the same problem with pieces of the source text being copied
> out-of-order (100 bytes from the end/middle of the source being copied
> to the beginning of the target, followed by the rest of the source)?

Let's take a look at the differences between a line-based edit stream
diff (such as what you'd see in the output of diff -e) and a binary
delta as we have in Subversion.

The most obvious difference is that the former traffics in lines, rather
than arbitrary byte ranges, but the actual differences are much deeper.
A line-based diff can be modeled with the following "instructions":

  * Copy the next N lines of source to target.
  * Skip the next N lines of source.
  * Copy the next N lines of new data to target.

After applying a diff like this, you can easily divide the target lines
into two categories: those which originated from the source, and those
which originated from the diff.  The division may not accurately
represent the intent of the change (there's the classic problem of the
mis-attributed close brace, for instance; see
http://bramcohen.livejournal.com/73318.html), but it's typically pretty
close.

Subversion binary deltas have a more flexible instruction set, more akin
to what you'd find in a compression algorithm.  The source and target
are chopped up into windows, and for each window you have:

  * Copy N bytes from offset O in the source window to target.
  * Copy N bytes from offset O in the target window to target.
  * Copy the next N bytes of new data to target.

There is no easy way to divide the target into source bytes and diff
bytes.  Certainly, you can tag which bytes were copied from the source
window, but that's meaningless.  Bytes which came from the source window
may have been rearranged by the diff; bytes which came from new data may
only have done so because of windowing.

The optimization idea is to create a new kind of diff (or more likely,
research an existing algorithm) which obeys the rules of the line-based
edit stream--no windowing, sequential access only into the source
stream--but traffics in bytes instead of lines.  With such a diff in
hand, we can divide the target bytes into source-origin and diff-origin,
and then, after splitting the target into lines, determine which lines
are "tainted" by diff-origin bytes and therefore should be viewed as
originating in the diff.

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Aug 20, 2010 at 9:11 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>> After eliminating antivirus, and running with a release build instead
>> of a debug build, svn diff is just about on par with GNU diff. So this
>> eliminates the option of optimizing diff ...
>
> Unless ...
>
> For every diff during blame calculation, tokens (lines) are extracted
> and processed each time for the "original" and the "modified". This
> takes up most of the time of svn diff. However, the file that is
> playing the "modified" role in one step, will be the "original" in the
> next step of blame processing. So in theory we already have those
> tokens from the previous step, and don't have to extract/compare/...
> them again.
>
> If one could find a way to recycle the tokens from the "modified" of
> the previous diff into the tokens of the "original" of the next diff,
> that would probably make the diff part of the blame operation almost
> twice as fast. And since diffing still accounts for ~90% of blame time
> on the client, that should speed it up considerably.
>
> Sounds like a plan?
>
> I'll try to write some sort of POC for this idea soon, unless someone
> tells me it's a stupid idea :-).

Hm, after several attempts to get this working, it seems to be a dead
end. Whatever I do, I can't seem to avoid lifetime issues, dangling
pointers, tokens referring to files which are no longer there, ...

I first tried just to pass through the position_list[1] (position_list
from the modified file) to the next diff run as the new
position_list[0]. I did this by returning the position_list[1] in an
output parameter, up to the call in blame.c!add_file_blame, and then
reusing that as input parameter for the next diff call (by stashing it
away in the file_rev_baton, until the next run). That works, but it
doesn't help, because the real power of the algorithm is in the tree,
in which all the tokens from both original and modified are sorted and
made unique (i.e. each unique token is represented only once in the
tree). After the tree is correctly set up, the rest of the algo is
very fast, because it only needs to compare token references to find
the longest common subsequence.

At the end of each run the tree is filled with a combination of the
tokens of original and modified, so I can't just reuse/recycle that,
because that would lead to a major memory leak (building up the tree
with all the tokens from all revisions of the entire blame operation).
And refilling a clean tree with the tokens already present in the
passed-on-position_list also doesn't work, because the token content
really needs to still be accessible (either in memory, or readable
from the original file), so that token_compare can work when adding
the new tokens from the new modified file ...

Anyway, I don't know if this still makes sense. I'm just noting it
down to order my thoughts, and maybe help someone else thinking about
this. I'm gonna let it rest now, because I seem to have hit a brick
wall with this.

Will focus my efforts on other approaches to speed up blame, such as
the fundamental changes of binary blaming, avoiding diffs, ... but I'm
not sure if I'll be able to (have the time to) climb the learning
curve for that.

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> After eliminating antivirus, and running with a release build instead
> of a debug build, svn diff is just about on par with GNU diff. So this
> eliminates the option of optimizing diff ...

Unless ...

For every diff during blame calculation, tokens (lines) are extracted
and processed each time for the "original" and the "modified". This
takes up most of the time of svn diff. However, the file that is
playing the "modified" role in one step, will be the "original" in the
next step of blame processing. So in theory we already have those
tokens from the previous step, and don't have to extract/compare/...
them again.

If one could find a way to recycle the tokens from the "modified" of
the previous diff into the tokens of the "original" of the next diff,
that would probably make the diff part of the blame operation almost
twice as fast. And since diffing still accounts for ~90% of blame time
on the client, that should speed it up considerably.

Sounds like a plan?

I'll try to write some sort of POC for this idea soon, unless someone
tells me it's a stupid idea :-).

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Aug 17, 2010 at 3:26 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> Another thing that occurred to me: since most time of the current
> blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe
> a quick win could be to simply (?) optimize the diff code? Or write a
> specialized faster version for blame.
>
> On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
> about 500 ms on my machine. GNU diff is much faster (300 ms for the
> first run, 72 ms on following runs). This seems to indicate that there
> is much room for optimization of svn diff. Or is there something extra
> that svn diff does, necessary in the svn context?
>
> I have looked a little bit at the svn diff code, and saw that most of
> the time is spent in the while loop inside svn_diff__get_tokens in
> token.c, presumably extracting the tokens (lines) from the file(s).
> Haven't looked any further/deeper. Anybody have any brilliant
> ideas/suggestions? Or is this a bad idea, not worthy of further
> exploration :-) ?

As noted in http://svn.haxx.se/dev/archive-2010-08/0517.shtml, my
performance measurements of svn diff were quite inaccurate, to say the
least.

After eliminating antivirus, and running with a release build instead
of a debug build, svn diff is just about on par with GNU diff. So this
eliminates the option of optimizing diff ...

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Aug 12, 2010 at 5:30 PM, Greg Hudson <gh...@mit.edu> wrote:
> On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
>> I'm wary of embedding any client functionality in the server, but I
>> guess it's worth considering if it would be that useful.  If so, let's
>> take great care to ensure it's only lightly coupled to the core server
>> logic.
>
> Again, it's possible that binary diffs between sequential revisions
> could be used for blame purposes (not the binary deltas we have now, but
> edit-stream-style binary diffs), which would decouple the
> line-processing logic from the server.
>
> (But again, I haven't thought through the problem in enough detail to be
> certain.)

If such edit-stream-style binary diffs could do the job, and they are
"fast enough" (I'm guessing that line based vs. binary wouldn't make
that much of a difference for the eventual blame processing), it seems
like a good compromise: we get the performance benefits of
blame-oriented delta's (supposedly fast and easy to calculate blame
info from), possibly cached on the server, while still not introducing
unnecessary coupling of the server to line-processing logic.

Greg, could you explain a bit more what you mean with
"edit-stream-style binary diffs", vs. the binary deltas we have now?
Could you perhaps give an example similar to Julian's? Wouldn't you
have the same problem with pieces of the source text being copied
out-of-order (100 bytes from the end/middle of the source being copied
to the beginning of the target, followed by the rest of the source)?
Wouldn't you also have to do the work of discovering the largest
contiguous block of source text as "the main stream", so determine
that those first 100 bytes are to be interpreted as new bytes, etc?

Caching this stuff on the server would of course be ideal. Whether it
be "post-commit" or on-demand (first guy requesting the blame takes
the hit), both approaches seem good to me. Working on that would be
severely out of my league though :-). At least for now.

Another thing that occurred to me: since most time of the current
blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe
a quick win could be to simply (?) optimize the diff code? Or write a
specialized faster version for blame.

On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
about 500 ms on my machine. GNU diff is much faster (300 ms for the
first run, 72 ms on following runs). This seems to indicate that there
is much room for optimization of svn diff. Or is there something extra
that svn diff does, necessary in the svn context?

I have looked a little bit at the svn diff code, and saw that most of
the time is spent in the while loop inside svn_diff__get_tokens in
token.c, presumably extracting the tokens (lines) from the file(s).
Haven't looked any further/deeper. Anybody have any brilliant
ideas/suggestions? Or is this a bad idea, not worthy of further
exploration :-) ?

BTW, I also tested with Stefan Fuhrmann's performance branch@r985697,
just for kicks (had some trouble building it on Windows, but
eventually managed to get an svn.exe out of it). The timing of svn
diff of such a large file was about the same, so that didn't help. But
maybe the branch isn't ready for prime time just yet ...

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.

On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
> I'm wary of embedding any client functionality in the server, but I
> guess it's worth considering if it would be that useful.  If so, let's
> take great care to ensure it's only lightly coupled to the core server
> logic.

Again, it's possible that binary diffs between sequential revisions
could be used for blame purposes (not the binary deltas we have now, but
edit-stream-style binary diffs), which would decouple the
line-processing logic from the server.

(But again, I haven't thought through the problem in enough detail to be
certain.)

Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.

On Thu, 2010-08-12, C. Michael Pilato wrote:
> In times past, I've wondered if the server couldn't just store line-delta
> information -- as a comparison between each FS DAG node and its immediate
> predecessor -- similar to the way that CVS does (and in addition to the
> stuff it already stores, of course).  The line-delta info could be populated
> post-commit just as the BDB backend did deltafication, or perhaps also on
> demand (to rebuild this information for older servers) via some 'svnadmin'
> command.

The server could usefully calculate and store it - but on demand and
cached, not at commit time - see below.

>   But it shouldn't ever change once calculated, right?

Well ... that depends.  The line-ending style the client wants to use
can vary over time: in order to cope with the cases where it was not
specified correctly in earlier revisions, the client may want to use the
EOL style specified by HEAD (or the latest version being blamed).  It
may be possible to calculate and store a generic data set that will be
useful whatever EOL style the client eventually decides it should be
using.

> My only concern is in dealing with the definition of a "line".  The FS layer
> today is happily file-content agnostic.  All EOL translation stuffs are
> considered voodoo of interest only to the client layers.  We could, of
> course, choose to make the FS layer pay attention to the likes of the
> svn:eol-style property, but that feels mucho icky.
> 
> Thoughts?

If the decision to calculate and store linewise info for a given file is
made at commit time by the server, it would probably want to consider
svn:mime-type or the likes, to avoid doing it unnecessarily on all
files.  Clients might then never request blame info on a majority of
files.  Conversely, a client might request blame info for a file on
which the server thought the data would not be needed (perhaps because
it was marked 'application/octet-stream' for example).

I would think that having the server calculate the info on demand and
store it in a limited-lifetime cache is more sensible, if we take a
server-assisted approach at all.  A cache would handle the cases above
better, and could also be made to handle requests with varying
definitions of EOL style if that is necessary.

I'm wary of embedding any client functionality in the server, but I
guess it's worth considering if it would be that useful.  If so, let's
take great care to ensure it's only lightly coupled to the core server
logic.

- Julian

Re: Looking to improve performance of svn annotate

Posted by "C. Michael Pilato" <cm...@collab.net>.

In times past, I've wondered if the server couldn't just store line-delta
information -- as a comparison between each FS DAG node and its immediate
predecessor -- similar to the way that CVS does (and in addition to the
stuff it already stores, of course).  The line-delta info could be populated
post-commit just as the BDB backend did deltafication, or perhaps also on
demand (to rebuild this information for older servers) via some 'svnadmin'
command.  But it shouldn't ever change once calculated, right?

My only concern is in dealing with the definition of a "line".  The FS layer
today is happily file-content agnostic.  All EOL translation stuffs are
considered voodoo of interest only to the client layers.  We could, of
course, choose to make the FS layer pay attention to the likes of the
svn:eol-style property, but that feels mucho icky.

Thoughts?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2010-08-11 at 19:14 -0400, Johan Corveleyn wrote:
> I naively thought that the server, upon being called get_file_revs2,
> would just supply the deltas which it has already stored in the
> repository. I.e. that the deltas are just the native format in which
> the stuff is kept in the back-end FS, and the server wasn't doing much
> else but iterate through the relevant files, and extract the relevant
> bits.

The server doesn't have deltas between each revision and the next (or
previous).  Instead, it has "skip-deltas" which may point backward a
large number of revisions.  This allows any revision of a file to be
reconstructed in O(log(n)) delta applications, where n is the number of
file revisions, but it makes "what the server has lying around" even
less useful for blame output.

It's probably best to think of the FS as a black box which can produce
any version of a file in reasonable time.  If you look at
svn_repos_get_file_revs2 in libsvn_repos/rev_hunt.c, you'll see the code
which produces deltas to send to the client, using
svn_fs_get_file_delta_stream.

The required code changes for this kind of optimization would be fairly
deep, I think.  You'd have to invent a new type of "diffy" delta
algorithm (either line-based or binary, but either way producing an edit
stream rather than acting like a compression algorithm), and then
parameterize a bunch of functions which produce deltas, and then have
the server-side code produce diffy deltas, and then have the client code
recognize when it's getting diffy deltas and behave more efficiently.

If the new diffy-delta algorithm isn't format-compatible with the
current encoding, you'd also need some protocol negotiation.

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.

Thanks for the input/explanations, Julian and Greg. Some reactions below ...

On Wed, Aug 11, 2010 at 12:04 PM, Julian Foad <ju...@wandisco.com> wrote:
[...]
> I hadn't really appreciated this difficulty.  By noticing the copy
> source positions when interpreting the delta, we might be able to do
> better:
>
>  cp from r1[4:10]  "TWO\nTH"  => OK, this is r1 source text
>  cp from r1[12:14] "RE"       => "we skipped/deleted 2 source bytes"
>  cp from r1[2:8]   "E\nTWO\n" => "this is out of order, so is an add"
>
> which we could store in our intermediate data structure as
>
>  r1  "TWO\nTH"
>  r2  ""           # recording a deletion at this position
>  r1  "RE"
>  r2  "E\nTWO\n"   # recorded as an addition
>
> That's a little better, but it relies on having some robust way to
> decide which copies are "out of order" and which copies are "getting
> back into the main sequential flow of the source", taking into account
> that copy source ranges can come from behind and ahead of the "current"
> position and can overlap with other copies.  And of course taking into
> account that copies of long chunks of the source text should take
> priority over short bits when it comes to deciding which revision the
> text belongs to.
>
> I'm not sure if there's a feasible solution.

How does diff do that? If the target starts with some copy of the end
of the source, followed by the main part of the source from the
beginning. Or just a small piece of the source that's "out of order",
followed by a bigger chunk of source that's more in the "main
sequential flow". Maybe the problems are similar?

Maybe I should study the diff code a bit ...

On Wed, Aug 11, 2010 at 4:39 PM, Greg Hudson <gh...@mit.edu> wrote:
> In the process of a blame operation, the server is already performing
> binary deltas (which aren't directly usable for blame) between each pair
> of revs.  It wouldn't necessarily be any more work for the server to
> perform line-based diffs instead, although that would be the sort of
> text processing we don't ordinarily put into the server code.
>
> Alternatively, the server could perform a more diff-like binary delta,
> where the only instructions are "copy the next N bytes from the input"
> or "insert these new bytes."  Such a binary delta could possibly be
> transformed into line-based diffs on the client, although I'm not
> completely sure of that.

I haven't studied the server part of the operation that much yet. But
this seems like it could be an interesting (additional) approach. Any
pointers where I should start looking into the source code, where this
stuff is happening?

I naively thought that the server, upon being called get_file_revs2,
would just supply the deltas which it has already stored in the
repository. I.e. that the deltas are just the native format in which
the stuff is kept in the back-end FS, and the server wasn't doing much
else but iterate through the relevant files, and extract the relevant
bits.

But it seems that's not the case, and the server is calculating (some
of) these binary delta's on the fly? In that case, yes, that seems
like another good point to perform optimization, or do some
pre-processing work to make it easier for the client to calculate the
blame.

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2010-08-11 at 06:04 -0400, Julian Foad wrote:
> Ah.  An interesting point which I hadn't fully digested.  I think the
> gist of Branko's point is that the delta format doesn't distinguish
> between

In the process of a blame operation, the server is already performing
binary deltas (which aren't directly usable for blame) between each pair
of revs.  It wouldn't necessarily be any more work for the server to
perform line-based diffs instead, although that would be the sort of
text processing we don't ordinarily put into the server code.

Alternatively, the server could perform a more diff-like binary delta,
where the only instructions are "copy the next N bytes from the input"
or "insert these new bytes."  Such a binary delta could possibly be
transformed into line-based diffs on the client, although I'm not
completely sure of that.

Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.

On Wed, 2010-08-11, Johan Corveleyn wrote:
> Hi all,
> 
> Other priorities have unfortunately kept me from focusing on the
> project of speeding up blame. But recently I've spent some time
> thinking about it, reading the other mail threads, studying the code
> and profiling a little bit. I hope I can still do something useful for
> blame, whether it be sooner or later.
> 
> I will first give some updates, info and discussion, and end up with a
> couple of questions. Any input is greatly appreciated. A lot of this
> stuff is still very fuzzy to me :-).
> 
> First, some profiling info (After sprinkling some printf's and some
> time-counters into blame.c):
> 
> - Most of the client-side blame-time (~90%) is spent on producing the
> diffs (more specifically, the call to svn_diff_file_diff_2 in the
> add_file_blame function in blame.c).
> 
> - Applying the deltas to produce full-texts in the first place accounts for ~5%.
> 
> - Time spent on producing the blame info (inserting/deleting blame
> chunks) is negligible ( < 0.5%).
> 
> 
> Coming back with this information to the threads that Mark Phippard suggested:
> 
> On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> > If you haven't, you should review some of the old threads on speeding
> > up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> > timeframe and learned a lot about what does NOT speed it up.
> >
> > http://svn.haxx.se/dev/archive-2005-02/0275.shtml
> 
> This thread mainly ended up focusing on the "delta combiner", with the
> conclusion of replacing vdelta with xdelta. I think this was a very
> good optimization of which we now reap the benefits. If I understand
> my profiling numbers correctly, the delta combining stuff is no longer
> the biggest bottleneck (at least not on the client-side).
> 
> There is however another interesting point/question from this thread, see below.
> 
> > You should really search around for all of the threads he commented on
> > to get the complete picture.  I think he also provided ideas on what
> > more could potentially be done in the future.  Such as this one that I
> > do not recall was ever done.
> >
> > http://svn.haxx.se/dev/archive-2007-09/0459.shtml
> >
> > Maybe the thread will explain why.
> 
> The thread didn't really explain why it wasn't implemented, but the
> poster said he had little time to work on it further. The suggestions
> were improvements in the processing of the blame info (optimizations
> in iterating through the linked list of blame chunks, and discussion
> about the most optimal data structure to represent these blame
> chunks). The thread sort of ended with the question whether this would
> be a significant optimization:
> On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> > I'd also be interested in profiling output from your 1750-rev file to
> > see how much time is spent in the blame_* functions themselves vs
> > the svn_diff_* functions, so that we can at least give a nod towards
> > Amdahl's Law before we run off optimizing the wrong thing.
> 
> >From my profiling data, it seems optimizing those blame_* functions
> would indeed not be significant. Most of the time is spent in the
> svn_diff_* functions.
> 
> 
> Which brings us back to the idea of the binary blame, and Julian's
> description of the algorithm. Now, some questions:
> 
> 1) Dan Berlin's thread prompted an interesting response from Branko
> Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> > If by "diff format" you mean the binary delta, then... There was a time
> > when I thought this would be possible. Now I'm not so sure. The trouble
> > is that the vdelta doesn't generate an edit stream, it generates a
> > compressed block-copy stream. Which means that can never be 100% sure,
> > just from looking at the delta, which bytes in the target are really new
> > and which are just (offset) copies from the source. The only blocks you
> > can really be sure about are those that are represented by new data in
> > the delta (either NEW blocks or target copies that take data from NEW
> > blocks). This problem is made worse by our use of skip deltas in (at
> > least) the BDB back-end.
> >
> > I agree that it would be nice if the server could generate some sort of
> > byte-range oriented info, but I don't think it can be done just by
> > looking at the deltas. It's sad, I know...
> 
> What? So this idea was considered before, but rejected? Is this
> argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
> around that, does it mean this simply cannot work? Maybe Branko can
> comment on that? An example or some more detailed explanation would be
> very welcome.

Ah.  An interesting point which I hadn't fully digested.  I think the
gist of Branko's point is that the delta format doesn't distinguish
between

  "copy some old text because it's still here in the corresponding
location in the new version"

and

  "copy some old text, out of order, to generate text that we would
consider to be 'new' at this position in the new version".

We want to consider each piece of old text as being candidate for at
most one piece of the target text, but the delta format doesn't give us
many clues about that.

For example, let's say the repo content of the file in two revisions is:

  r1: "ONE\nTWO\nTHE REST\n"

  r2: "TWO\nTHREE\nTWO\n"

And let's say we're using this binary method with forward deltas, for
ease of thinking about it.  (The principle would be the same with
reverse deltas.)  The delta r1->r2 might look like (and this is just one
of many possible encodings):

  cp from r1: "TWO\nTH"
  new in r2:  "RE"
  cp from r1: "E\nTWO\n"

which we can simply split into lines and get the answer we want:

  "TWO\n"   @ max(1)     => r1
  "THREE\n" @ max(1,2,1) => r2
  "TWO\n"   @ max(1)     => r1

But the delta might instead look like:

  cp from r1: "TWO\nTH"
  cp from r1: "RE"
  cp from r1: "E\nTWO\n"

With a naïve interpretation, ignoring the source positions of these
copies, this would look like none of the lines are new or modified in
r2, which is clearly not an acceptable answer.

I hadn't really appreciated this difficulty.  By noticing the copy
source positions when interpreting the delta, we might be able to do
better:

  cp from r1[4:10]  "TWO\nTH"  => OK, this is r1 source text
  cp from r1[12:14] "RE"       => "we skipped/deleted 2 source bytes"
  cp from r1[2:8]   "E\nTWO\n" => "this is out of order, so is an add"

which we could store in our intermediate data structure as

  r1  "TWO\nTH"
  r2  ""           # recording a deletion at this position
  r1  "RE"
  r2  "E\nTWO\n"   # recorded as an addition

That's a little better, but it relies on having some robust way to
decide which copies are "out of order" and which copies are "getting
back into the main sequential flow of the source", taking into account
that copy source ranges can come from behind and ahead of the "current"
position and can overlap with other copies.  And of course taking into
account that copies of long chunks of the source text should take
priority over short bits when it comes to deciding which revision the
text belongs to.

I'm not sure if there's a feasible solution.


> 2) A concrete problem with the algorithm as described by Julian would
> be the delete of bytes within a line. E.g. suppose one deletes a
> single character in r60 ("oops, a boolean test was inverted, let's
> delete that stray '!'"). That deleted character wouldn't be
> represented in the binary blame structure (it's simply cut out), and
> it's obviously not present in the final text. But the line on which it
> occurred was now last edited in r60.
> 
> Current blame (diff-based) has no problem with this situation. It
> correctly shows r60 as the last revision for that line. But the binary
> algorithm loses information about the deleted bytes. That's a problem.
> I haven't really worked out how to handle that. Anybody have any
> ideas?

The deleted bytes issue is easy to handle.  The algorithm uses a data
structure to remember which revision is last known to have introduced
each contiguous range of bytes.  For each deletion, the structure must
contain a zero-length range, with the revision number in which something
was deleted that previously existed in that gap, as shown in my last
example above.

- Julian