You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Johan Corveleyn <jc...@gmail.com> on 2010/03/21 21:08:44 UTC

Looking to improve performance of svn annotate

Hi all,

I'm very much looking forward to the performance (and other)
improvements wc-ng will bring us. But there's more to svn performance
than wc stuff :-). In our company, "svn annotate/blame/praise" is a
known offender (people who attended subconf 2009 may remember me as
the guy who complained about the annotate of a 2Mb file with 6000 revs
taking over 4 hours).

So, in the name of "make it faster, then make it faster again", and
feeling that maybe it's time to scratch one of my own itches, I'd like
to take a stab at this myself. I have no idea whether I'm up to this
task (I am a total newbie at svn development), but I can try. Until
now I have just analyzed some parts of the source code. I have setup a
build environment on my machine (on windows), and have been able to
run it from inside a debugger (Visual C 2008 Express Edition), setting
breakpoints, seeing what happens, ...

I am specifically looking at the suggestion that Julian made in the
following comment in the issue tracker
(http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc5):

[[[
My idea for client-side improvement:

The client, while receiving binary diffs for successive revisions, could
calculate a byte-wise annotation of the file. (It probably would not have to
reconstruct full texts to do that.) Then, at the end, it could split the
byte-wise annotation at the file's line boundaries and report the highest
revision number within each line.
]]]

I might also take a look at Bert's suggestion in
http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc7
(reversing the algorithm).

If anyone could help me get started with this (more specifics about
the above idea(s), or other ideas that could give a significant boost;
any pointers, parts of the code where I could get started, ...), that
would be very much appreciated.

Cheers,
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Mar 22, 2010 at 11:51 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Sun, 2010-03-21, Johan Corveleyn wrote:
[ ... ]
>> (http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc5):
>>
>> [[[
>> My idea for client-side improvement:
>>
>> The client, while receiving binary diffs for successive revisions, could
>> calculate a byte-wise annotation of the file. (It probably would not have to
>> reconstruct full texts to do that.) Then, at the end, it could split the
>> byte-wise annotation at the file's line boundaries and report the highest
>> revision number within each line.
>> ]]]
>
> Hi Johan.
>
> Please accept my sincere apologies for not having replied to your
> private email about this.  I remember discussing this with you but have
> not thought more about it recently.

No problem (I wasn't sure you'd pick up my private email, so mailing
the list was plan B anyway :-)). It's probably better this way, all
the info directly on the list. Thanks a bunch for your explanation of
the algorithm. It makes a lot of sense. Now I just need to implement
it :-).

Small disclaimer: I hope I can get this rolling, but progress might be
slow at times (and my email responses as well). That's because I'll
have to work on this after I've put my kids to bed, and before my
brain starts to shut down :-) ...

[ ... snipped the actual algorithm description ... first have to chew
a little bit on it ... ]

>> I might also take a look at Bert's suggestion in
>> http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc7
>> (reversing the algorithm).
>>
>> If anyone could help me get started with this (more specifics about
>> the above idea(s), or other ideas that could give a significant boost;
>> any pointers, parts of the code where I could get started, ...), that
>> would be very much appreciated.
>
> The important first step is to do some profiling to find out how much
> time is being spent -
>
>  - on the server, calculating full texts from the stored diffs?
>
>  - transmitting data over the network?
>
>  - on the client, producing diffs?
>
>  - on the client, producing blame info from diffs?
>
> I've just re-read your message from last year
> <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2079688> in which you report it's mainly client-side CPU bound.  That's good to know.  (That thread is linked from the issue.)  So next, you would be well advised to profile the execution of the client code and see whether the textual diffs or some other part of the processing are taking the most time.

I'll certainly do some more profiling. I'm not too sure anymore about
my findings from last year ("mainly client-side cpu bound"). The
client is certainly a big bottleneck (client cpu is working it's ass
off most of the time), but I think there is more server-boundedness
than I thought at first.

> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
>
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history.  Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?

Thinking more about it, I don't think that reversing the algorithm and
stopping early will buy us much. I believe that in most cases
(certainly if you ignore whitespace changes, which always made the
most sense to me), you'll often have to go back to the earliest of
revisions, even if it's just for that one single header line which has
never changed since day 1.

In the case of our XML file, I'm sure that the "xml declaration" (the
line with "<?xml version="1.0"?>"), and other things like the root tag
have not changed since the first addition of the file. I think the
same may be true for traditional source files (for Java classes, the
"public class Bla" line might still be the original one (if the class
wasn't renamed); then again, if someone adds extends or implements it
might not; but the last "}"-line probably will).

Anyway, I'll first try to dig in to the binary blaming. If I get that
far, I'll see if it's still worth it to go for other optimizations ...

Johan

Re: Looking to improve performance of svn annotate

Posted by Philipp Marek <ph...@emerion.com>.
Hello Julian,

On Montag, 22. März 2010, Julian Foad wrote:
> Hi Philipp.  What do you mean exactly?  I wonder if you misunderstood
> when I said, "we read in the blamed file's text (just that one revision
> of it)".  I meant just the working revision of the file (or whichever
> revision the blame command specified as the operative version).  Not
> every revision.
thank you for your clarification.

No, that was a misunderstanding on my side - I wrongly read that after getting 
the bytes ranges you're going to the server again to fetch the corresponding 
file data.

This is of course bogus, because we should already have the final text.


The idea that should still work is the incremental byte-range building - for 
an often-changed file most of the revisions could be skipped, even if the user 
didn't specify a revision range.


Thanks again!


Regards,

Phil

Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.
On Mon, 2010-03-22 at 12:29 +0100, Philipp Marek wrote:
> Hello Julian,
> Hello Johan,
> 
> On Montag, 22. März 2010, Julian Foad wrote:
> ...
> > My basic idea is an algorithm on the client that takes, as its input, a
> > sequence of binary diffs, one for each revision in which the file
> > changed, representing that incremental change.
> ...
> > The algorithm applies one of these diffs at a time to an in-memory data
> > structure that represents which revision is associated with each byte in
> > the file.
> ...
> > Then, at the end, we read in the blamed file's text (just that one
> > revision of it), match up the byte ranges to the lines in the file, and
> > for each line we display the highest recorded revision number that falls
> > within that line.
> please forgive me for intruding with barely some knowledge to share, but that 
> sounds awful because of the round-trips (and latencies) involved.

Hi Philipp.  What do you mean exactly?  I wonder if you misunderstood
when I said, "we read in the blamed file's text (just that one revision
of it)".  I meant just the working revision of the file (or whichever
revision the blame command specified as the operative version).  Not
every revision.

Or, if your concern is the requesting of the successive diffs for every
revision in which the file changed, well, that's the same number of
round trips as what the current code does.  (It doesn't request every
revision number in the range, only the revision numbers where the file
content changed.)  Reducing the number of round-trips would be an
additional improvement for other users, but Johan's problem is
client-side processing, not the network.

[...]
> > As regards reversing the current algorithm, first you might want to find
> > out what is the oldest revision number that shows up in the blame of
> > your typical file, and compare that with all the changes in the file's
> > history.  Is it only reporting the most recent 1% of the file's changes,
> > in which case the reversed algorithm could achieve a 100-fold speed-up,
> > or 20%, or 80%?
> Well, how about using an iterative approach? Just do the algorithm you 
> described for the revisions [N-100:N]; if there's any line that's still at 
> N-100 fetch the blame information for [N-200:N-100], and substitute the 
> 'unknown' lines; repeat as necessary.

It sounds like you are assuming the blame is being calculated on the
server, but at present the client calculates it and so the client can
stop at whichever revision fills in the last gap.

If we were to pipeline the client-server traffic, then we might request
diffs in batches like you suggest.

- Julian


> Sounds simpler than a full reverse run, and might be efficient enough - only a 
> few revisions more than absolutely necessary are fetched and analyzed.



Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.
On Tue, 2010-08-17 at 09:26 -0400, Johan Corveleyn wrote:
> Greg, could you explain a bit more what you mean with
> "edit-stream-style binary diffs", vs. the binary deltas we have now?
> Could you perhaps give an example similar to Julian's? Wouldn't you
> have the same problem with pieces of the source text being copied
> out-of-order (100 bytes from the end/middle of the source being copied
> to the beginning of the target, followed by the rest of the source)?

Let's take a look at the differences between a line-based edit stream
diff (such as what you'd see in the output of diff -e) and a binary
delta as we have in Subversion.

The most obvious difference is that the former traffics in lines, rather
than arbitrary byte ranges, but the actual differences are much deeper.
A line-based diff can be modeled with the following "instructions":

  * Copy the next N lines of source to target.
  * Skip the next N lines of source.
  * Copy the next N lines of new data to target.

After applying a diff like this, you can easily divide the target lines
into two categories: those which originated from the source, and those
which originated from the diff.  The division may not accurately
represent the intent of the change (there's the classic problem of the
mis-attributed close brace, for instance; see
http://bramcohen.livejournal.com/73318.html), but it's typically pretty
close.

Subversion binary deltas have a more flexible instruction set, more akin
to what you'd find in a compression algorithm.  The source and target
are chopped up into windows, and for each window you have:

  * Copy N bytes from offset O in the source window to target.
  * Copy N bytes from offset O in the target window to target.
  * Copy the next N bytes of new data to target.

There is no easy way to divide the target into source bytes and diff
bytes.  Certainly, you can tag which bytes were copied from the source
window, but that's meaningless.  Bytes which came from the source window
may have been rearranged by the diff; bytes which came from new data may
only have done so because of windowing.

The optimization idea is to create a new kind of diff (or more likely,
research an existing algorithm) which obeys the rules of the line-based
edit stream--no windowing, sequential access only into the source
stream--but traffics in bytes instead of lines.  With such a diff in
hand, we can divide the target bytes into source-origin and diff-origin,
and then, after splitting the target into lines, determine which lines
are "tainted" by diff-origin bytes and therefore should be viewed as
originating in the diff.


Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Aug 20, 2010 at 9:11 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>> After eliminating antivirus, and running with a release build instead
>> of a debug build, svn diff is just about on par with GNU diff. So this
>> eliminates the option of optimizing diff ...
>
> Unless ...
>
> For every diff during blame calculation, tokens (lines) are extracted
> and processed each time for the "original" and the "modified". This
> takes up most of the time of svn diff. However, the file that is
> playing the "modified" role in one step, will be the "original" in the
> next step of blame processing. So in theory we already have those
> tokens from the previous step, and don't have to extract/compare/...
> them again.
>
> If one could find a way to recycle the tokens from the "modified" of
> the previous diff into the tokens of the "original" of the next diff,
> that would probably make the diff part of the blame operation almost
> twice as fast. And since diffing still accounts for ~90% of blame time
> on the client, that should speed it up considerably.
>
> Sounds like a plan?
>
> I'll try to write some sort of POC for this idea soon, unless someone
> tells me it's a stupid idea :-).

Hm, after several attempts to get this working, it seems to be a dead
end. Whatever I do, I can't seem to avoid lifetime issues, dangling
pointers, tokens referring to files which are no longer there, ...

I first tried just to pass through the position_list[1] (position_list
from the modified file) to the next diff run as the new
position_list[0]. I did this by returning the position_list[1] in an
output parameter, up to the call in blame.c!add_file_blame, and then
reusing that as input parameter for the next diff call (by stashing it
away in the file_rev_baton, until the next run). That works, but it
doesn't help, because the real power of the algorithm is in the tree,
in which all the tokens from both original and modified are sorted and
made unique (i.e. each unique token is represented only once in the
tree). After the tree is correctly set up, the rest of the algo is
very fast, because it only needs to compare token references to find
the longest common subsequence.

At the end of each run the tree is filled with a combination of the
tokens of original and modified, so I can't just reuse/recycle that,
because that would lead to a major memory leak (building up the tree
with all the tokens from all revisions of the entire blame operation).
And refilling a clean tree with the tokens already present in the
passed-on-position_list also doesn't work, because the token content
really needs to still be accessible (either in memory, or readable
from the original file), so that token_compare can work when adding
the new tokens from the new modified file ...

Anyway, I don't know if this still makes sense. I'm just noting it
down to order my thoughts, and maybe help someone else thinking about
this. I'm gonna let it rest now, because I seem to have hit a brick
wall with this.

Will focus my efforts on other approaches to speed up blame, such as
the fundamental changes of binary blaming, avoiding diffs, ... but I'm
not sure if I'll be able to (have the time to) climb the learning
curve for that.

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> After eliminating antivirus, and running with a release build instead
> of a debug build, svn diff is just about on par with GNU diff. So this
> eliminates the option of optimizing diff ...

Unless ...

For every diff during blame calculation, tokens (lines) are extracted
and processed each time for the "original" and the "modified". This
takes up most of the time of svn diff. However, the file that is
playing the "modified" role in one step, will be the "original" in the
next step of blame processing. So in theory we already have those
tokens from the previous step, and don't have to extract/compare/...
them again.

If one could find a way to recycle the tokens from the "modified" of
the previous diff into the tokens of the "original" of the next diff,
that would probably make the diff part of the blame operation almost
twice as fast. And since diffing still accounts for ~90% of blame time
on the client, that should speed it up considerably.

Sounds like a plan?

I'll try to write some sort of POC for this idea soon, unless someone
tells me it's a stupid idea :-).

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Tue, Aug 17, 2010 at 3:26 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> Another thing that occurred to me: since most time of the current
> blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe
> a quick win could be to simply (?) optimize the diff code? Or write a
> specialized faster version for blame.
>
> On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
> about 500 ms on my machine. GNU diff is much faster (300 ms for the
> first run, 72 ms on following runs). This seems to indicate that there
> is much room for optimization of svn diff. Or is there something extra
> that svn diff does, necessary in the svn context?
>
> I have looked a little bit at the svn diff code, and saw that most of
> the time is spent in the while loop inside svn_diff__get_tokens in
> token.c, presumably extracting the tokens (lines) from the file(s).
> Haven't looked any further/deeper. Anybody have any brilliant
> ideas/suggestions? Or is this a bad idea, not worthy of further
> exploration :-) ?

As noted in http://svn.haxx.se/dev/archive-2010-08/0517.shtml, my
performance measurements of svn diff were quite inaccurate, to say the
least.

After eliminating antivirus, and running with a release build instead
of a debug build, svn diff is just about on par with GNU diff. So this
eliminates the option of optimizing diff ...

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Thu, Aug 12, 2010 at 5:30 PM, Greg Hudson <gh...@mit.edu> wrote:
> On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
>> I'm wary of embedding any client functionality in the server, but I
>> guess it's worth considering if it would be that useful.  If so, let's
>> take great care to ensure it's only lightly coupled to the core server
>> logic.
>
> Again, it's possible that binary diffs between sequential revisions
> could be used for blame purposes (not the binary deltas we have now, but
> edit-stream-style binary diffs), which would decouple the
> line-processing logic from the server.
>
> (But again, I haven't thought through the problem in enough detail to be
> certain.)

If such edit-stream-style binary diffs could do the job, and they are
"fast enough" (I'm guessing that line based vs. binary wouldn't make
that much of a difference for the eventual blame processing), it seems
like a good compromise: we get the performance benefits of
blame-oriented delta's (supposedly fast and easy to calculate blame
info from), possibly cached on the server, while still not introducing
unnecessary coupling of the server to line-processing logic.

Greg, could you explain a bit more what you mean with
"edit-stream-style binary diffs", vs. the binary deltas we have now?
Could you perhaps give an example similar to Julian's? Wouldn't you
have the same problem with pieces of the source text being copied
out-of-order (100 bytes from the end/middle of the source being copied
to the beginning of the target, followed by the rest of the source)?
Wouldn't you also have to do the work of discovering the largest
contiguous block of source text as "the main stream", so determine
that those first 100 bytes are to be interpreted as new bytes, etc?

Caching this stuff on the server would of course be ideal. Whether it
be "post-commit" or on-demand (first guy requesting the blame takes
the hit), both approaches seem good to me. Working on that would be
severely out of my league though :-). At least for now.


Another thing that occurred to me: since most time of the current
blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe
a quick win could be to simply (?) optimize the diff code? Or write a
specialized faster version for blame.

On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
about 500 ms on my machine. GNU diff is much faster (300 ms for the
first run, 72 ms on following runs). This seems to indicate that there
is much room for optimization of svn diff. Or is there something extra
that svn diff does, necessary in the svn context?

I have looked a little bit at the svn diff code, and saw that most of
the time is spent in the while loop inside svn_diff__get_tokens in
token.c, presumably extracting the tokens (lines) from the file(s).
Haven't looked any further/deeper. Anybody have any brilliant
ideas/suggestions? Or is this a bad idea, not worthy of further
exploration :-) ?

BTW, I also tested with Stefan Fuhrmann's performance branch@r985697,
just for kicks (had some trouble building it on Windows, but
eventually managed to get an svn.exe out of it). The timing of svn
diff of such a large file was about the same, so that didn't help. But
maybe the branch isn't ready for prime time just yet ...

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.
On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
> I'm wary of embedding any client functionality in the server, but I
> guess it's worth considering if it would be that useful.  If so, let's
> take great care to ensure it's only lightly coupled to the core server
> logic.

Again, it's possible that binary diffs between sequential revisions
could be used for blame purposes (not the binary deltas we have now, but
edit-stream-style binary diffs), which would decouple the
line-processing logic from the server.

(But again, I haven't thought through the problem in enough detail to be
certain.)


Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.
On Thu, 2010-08-12, C. Michael Pilato wrote:
> In times past, I've wondered if the server couldn't just store line-delta
> information -- as a comparison between each FS DAG node and its immediate
> predecessor -- similar to the way that CVS does (and in addition to the
> stuff it already stores, of course).  The line-delta info could be populated
> post-commit just as the BDB backend did deltafication, or perhaps also on
> demand (to rebuild this information for older servers) via some 'svnadmin'
> command.

The server could usefully calculate and store it - but on demand and
cached, not at commit time - see below.

>   But it shouldn't ever change once calculated, right?

Well ... that depends.  The line-ending style the client wants to use
can vary over time: in order to cope with the cases where it was not
specified correctly in earlier revisions, the client may want to use the
EOL style specified by HEAD (or the latest version being blamed).  It
may be possible to calculate and store a generic data set that will be
useful whatever EOL style the client eventually decides it should be
using.

> My only concern is in dealing with the definition of a "line".  The FS layer
> today is happily file-content agnostic.  All EOL translation stuffs are
> considered voodoo of interest only to the client layers.  We could, of
> course, choose to make the FS layer pay attention to the likes of the
> svn:eol-style property, but that feels mucho icky.
> 
> Thoughts?

If the decision to calculate and store linewise info for a given file is
made at commit time by the server, it would probably want to consider
svn:mime-type or the likes, to avoid doing it unnecessarily on all
files.  Clients might then never request blame info on a majority of
files.  Conversely, a client might request blame info for a file on
which the server thought the data would not be needed (perhaps because
it was marked 'application/octet-stream' for example).

I would think that having the server calculate the info on demand and
store it in a limited-lifetime cache is more sensible, if we take a
server-assisted approach at all.  A cache would handle the cases above
better, and could also be made to handle requests with varying
definitions of EOL style if that is necessary.

I'm wary of embedding any client functionality in the server, but I
guess it's worth considering if it would be that useful.  If so, let's
take great care to ensure it's only lightly coupled to the core server
logic.

- Julian


Re: Looking to improve performance of svn annotate

Posted by "C. Michael Pilato" <cm...@collab.net>.
In times past, I've wondered if the server couldn't just store line-delta
information -- as a comparison between each FS DAG node and its immediate
predecessor -- similar to the way that CVS does (and in addition to the
stuff it already stores, of course).  The line-delta info could be populated
post-commit just as the BDB backend did deltafication, or perhaps also on
demand (to rebuild this information for older servers) via some 'svnadmin'
command.  But it shouldn't ever change once calculated, right?

My only concern is in dealing with the definition of a "line".  The FS layer
today is happily file-content agnostic.  All EOL translation stuffs are
considered voodoo of interest only to the client layers.  We could, of
course, choose to make the FS layer pay attention to the likes of the
svn:eol-style property, but that feels mucho icky.

Thoughts?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.
On Wed, 2010-08-11 at 19:14 -0400, Johan Corveleyn wrote:
> I naively thought that the server, upon being called get_file_revs2,
> would just supply the deltas which it has already stored in the
> repository. I.e. that the deltas are just the native format in which
> the stuff is kept in the back-end FS, and the server wasn't doing much
> else but iterate through the relevant files, and extract the relevant
> bits.

The server doesn't have deltas between each revision and the next (or
previous).  Instead, it has "skip-deltas" which may point backward a
large number of revisions.  This allows any revision of a file to be
reconstructed in O(log(n)) delta applications, where n is the number of
file revisions, but it makes "what the server has lying around" even
less useful for blame output.

It's probably best to think of the FS as a black box which can produce
any version of a file in reasonable time.  If you look at
svn_repos_get_file_revs2 in libsvn_repos/rev_hunt.c, you'll see the code
which produces deltas to send to the client, using
svn_fs_get_file_delta_stream.

The required code changes for this kind of optimization would be fairly
deep, I think.  You'd have to invent a new type of "diffy" delta
algorithm (either line-based or binary, but either way producing an edit
stream rather than acting like a compression algorithm), and then
parameterize a bunch of functions which produce deltas, and then have
the server-side code produce diffy deltas, and then have the client code
recognize when it's getting diffy deltas and behave more efficiently.

If the new diffy-delta algorithm isn't format-compatible with the
current encoding, you'd also need some protocol negotiation.


Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
Thanks for the input/explanations, Julian and Greg. Some reactions below ...

On Wed, Aug 11, 2010 at 12:04 PM, Julian Foad <ju...@wandisco.com> wrote:
[...]
> I hadn't really appreciated this difficulty.  By noticing the copy
> source positions when interpreting the delta, we might be able to do
> better:
>
>  cp from r1[4:10]  "TWO\nTH"  => OK, this is r1 source text
>  cp from r1[12:14] "RE"       => "we skipped/deleted 2 source bytes"
>  cp from r1[2:8]   "E\nTWO\n" => "this is out of order, so is an add"
>
> which we could store in our intermediate data structure as
>
>  r1  "TWO\nTH"
>  r2  ""           # recording a deletion at this position
>  r1  "RE"
>  r2  "E\nTWO\n"   # recorded as an addition
>
> That's a little better, but it relies on having some robust way to
> decide which copies are "out of order" and which copies are "getting
> back into the main sequential flow of the source", taking into account
> that copy source ranges can come from behind and ahead of the "current"
> position and can overlap with other copies.  And of course taking into
> account that copies of long chunks of the source text should take
> priority over short bits when it comes to deciding which revision the
> text belongs to.
>
> I'm not sure if there's a feasible solution.

How does diff do that? If the target starts with some copy of the end
of the source, followed by the main part of the source from the
beginning. Or just a small piece of the source that's "out of order",
followed by a bigger chunk of source that's more in the "main
sequential flow". Maybe the problems are similar?

Maybe I should study the diff code a bit ...

On Wed, Aug 11, 2010 at 4:39 PM, Greg Hudson <gh...@mit.edu> wrote:
> In the process of a blame operation, the server is already performing
> binary deltas (which aren't directly usable for blame) between each pair
> of revs.  It wouldn't necessarily be any more work for the server to
> perform line-based diffs instead, although that would be the sort of
> text processing we don't ordinarily put into the server code.
>
> Alternatively, the server could perform a more diff-like binary delta,
> where the only instructions are "copy the next N bytes from the input"
> or "insert these new bytes."  Such a binary delta could possibly be
> transformed into line-based diffs on the client, although I'm not
> completely sure of that.

I haven't studied the server part of the operation that much yet. But
this seems like it could be an interesting (additional) approach. Any
pointers where I should start looking into the source code, where this
stuff is happening?

I naively thought that the server, upon being called get_file_revs2,
would just supply the deltas which it has already stored in the
repository. I.e. that the deltas are just the native format in which
the stuff is kept in the back-end FS, and the server wasn't doing much
else but iterate through the relevant files, and extract the relevant
bits.

But it seems that's not the case, and the server is calculating (some
of) these binary delta's on the fly? In that case, yes, that seems
like another good point to perform optimization, or do some
pre-processing work to make it easier for the client to calculate the
blame.


Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Greg Hudson <gh...@MIT.EDU>.
On Wed, 2010-08-11 at 06:04 -0400, Julian Foad wrote:
> Ah.  An interesting point which I hadn't fully digested.  I think the
> gist of Branko's point is that the delta format doesn't distinguish
> between

In the process of a blame operation, the server is already performing
binary deltas (which aren't directly usable for blame) between each pair
of revs.  It wouldn't necessarily be any more work for the server to
perform line-based diffs instead, although that would be the sort of
text processing we don't ordinarily put into the server code.

Alternatively, the server could perform a more diff-like binary delta,
where the only instructions are "copy the next N bytes from the input"
or "insert these new bytes."  Such a binary delta could possibly be
transformed into line-based diffs on the client, although I'm not
completely sure of that.


Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.
On Wed, 2010-08-11, Johan Corveleyn wrote:
> Hi all,
> 
> Other priorities have unfortunately kept me from focusing on the
> project of speeding up blame. But recently I've spent some time
> thinking about it, reading the other mail threads, studying the code
> and profiling a little bit. I hope I can still do something useful for
> blame, whether it be sooner or later.
> 
> I will first give some updates, info and discussion, and end up with a
> couple of questions. Any input is greatly appreciated. A lot of this
> stuff is still very fuzzy to me :-).
> 
> First, some profiling info (After sprinkling some printf's and some
> time-counters into blame.c):
> 
> - Most of the client-side blame-time (~90%) is spent on producing the
> diffs (more specifically, the call to svn_diff_file_diff_2 in the
> add_file_blame function in blame.c).
> 
> - Applying the deltas to produce full-texts in the first place accounts for ~5%.
> 
> - Time spent on producing the blame info (inserting/deleting blame
> chunks) is negligible ( < 0.5%).
> 
> 
> Coming back with this information to the threads that Mark Phippard suggested:
> 
> On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> > If you haven't, you should review some of the old threads on speeding
> > up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> > timeframe and learned a lot about what does NOT speed it up.
> >
> > http://svn.haxx.se/dev/archive-2005-02/0275.shtml
> 
> This thread mainly ended up focusing on the "delta combiner", with the
> conclusion of replacing vdelta with xdelta. I think this was a very
> good optimization of which we now reap the benefits. If I understand
> my profiling numbers correctly, the delta combining stuff is no longer
> the biggest bottleneck (at least not on the client-side).
> 
> There is however another interesting point/question from this thread, see below.
> 
> > You should really search around for all of the threads he commented on
> > to get the complete picture.  I think he also provided ideas on what
> > more could potentially be done in the future.  Such as this one that I
> > do not recall was ever done.
> >
> > http://svn.haxx.se/dev/archive-2007-09/0459.shtml
> >
> > Maybe the thread will explain why.
> 
> The thread didn't really explain why it wasn't implemented, but the
> poster said he had little time to work on it further. The suggestions
> were improvements in the processing of the blame info (optimizations
> in iterating through the linked list of blame chunks, and discussion
> about the most optimal data structure to represent these blame
> chunks). The thread sort of ended with the question whether this would
> be a significant optimization:
> On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> > I'd also be interested in profiling output from your 1750-rev file to
> > see how much time is spent in the blame_* functions themselves vs
> > the svn_diff_* functions, so that we can at least give a nod towards
> > Amdahl's Law before we run off optimizing the wrong thing.
> 
> >From my profiling data, it seems optimizing those blame_* functions
> would indeed not be significant. Most of the time is spent in the
> svn_diff_* functions.
> 
> 
> Which brings us back to the idea of the binary blame, and Julian's
> description of the algorithm. Now, some questions:
> 
> 1) Dan Berlin's thread prompted an interesting response from Branko
> Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> > If by "diff format" you mean the binary delta, then... There was a time
> > when I thought this would be possible. Now I'm not so sure. The trouble
> > is that the vdelta doesn't generate an edit stream, it generates a
> > compressed block-copy stream. Which means that can never be 100% sure,
> > just from looking at the delta, which bytes in the target are really new
> > and which are just (offset) copies from the source. The only blocks you
> > can really be sure about are those that are represented by new data in
> > the delta (either NEW blocks or target copies that take data from NEW
> > blocks). This problem is made worse by our use of skip deltas in (at
> > least) the BDB back-end.
> >
> > I agree that it would be nice if the server could generate some sort of
> > byte-range oriented info, but I don't think it can be done just by
> > looking at the deltas. It's sad, I know...
> 
> What? So this idea was considered before, but rejected? Is this
> argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
> around that, does it mean this simply cannot work? Maybe Branko can
> comment on that? An example or some more detailed explanation would be
> very welcome.

Ah.  An interesting point which I hadn't fully digested.  I think the
gist of Branko's point is that the delta format doesn't distinguish
between

  "copy some old text because it's still here in the corresponding
location in the new version"

and

  "copy some old text, out of order, to generate text that we would
consider to be 'new' at this position in the new version".

We want to consider each piece of old text as being candidate for at
most one piece of the target text, but the delta format doesn't give us
many clues about that.

For example, let's say the repo content of the file in two revisions is:

  r1: "ONE\nTWO\nTHE REST\n"

  r2: "TWO\nTHREE\nTWO\n"

And let's say we're using this binary method with forward deltas, for
ease of thinking about it.  (The principle would be the same with
reverse deltas.)  The delta r1->r2 might look like (and this is just one
of many possible encodings):

  cp from r1: "TWO\nTH"
  new in r2:  "RE"
  cp from r1: "E\nTWO\n"

which we can simply split into lines and get the answer we want:

  "TWO\n"   @ max(1)     => r1
  "THREE\n" @ max(1,2,1) => r2
  "TWO\n"   @ max(1)     => r1

But the delta might instead look like:

  cp from r1: "TWO\nTH"
  cp from r1: "RE"
  cp from r1: "E\nTWO\n"

With a naïve interpretation, ignoring the source positions of these
copies, this would look like none of the lines are new or modified in
r2, which is clearly not an acceptable answer.

I hadn't really appreciated this difficulty.  By noticing the copy
source positions when interpreting the delta, we might be able to do
better:

  cp from r1[4:10]  "TWO\nTH"  => OK, this is r1 source text
  cp from r1[12:14] "RE"       => "we skipped/deleted 2 source bytes"
  cp from r1[2:8]   "E\nTWO\n" => "this is out of order, so is an add"

which we could store in our intermediate data structure as

  r1  "TWO\nTH"
  r2  ""           # recording a deletion at this position
  r1  "RE"
  r2  "E\nTWO\n"   # recorded as an addition

That's a little better, but it relies on having some robust way to
decide which copies are "out of order" and which copies are "getting
back into the main sequential flow of the source", taking into account
that copy source ranges can come from behind and ahead of the "current"
position and can overlap with other copies.  And of course taking into
account that copies of long chunks of the source text should take
priority over short bits when it comes to deciding which revision the
text belongs to.

I'm not sure if there's a feasible solution.


> 2) A concrete problem with the algorithm as described by Julian would
> be the delete of bytes within a line. E.g. suppose one deletes a
> single character in r60 ("oops, a boolean test was inverted, let's
> delete that stray '!'"). That deleted character wouldn't be
> represented in the binary blame structure (it's simply cut out), and
> it's obviously not present in the final text. But the line on which it
> occurred was now last edited in r60.
> 
> Current blame (diff-based) has no problem with this situation. It
> correctly shows r60 as the last revision for that line. But the binary
> algorithm loses information about the deleted bytes. That's a problem.
> I haven't really worked out how to handle that. Anybody have any
> ideas?

The deleted bytes issue is easy to handle.  The algorithm uses a data
structure to remember which revision is last known to have introduced
each contiguous range of bytes.  For each deletion, the structure must
contain a zero-length range, with the revision number in which something
was deleted that previously existed in that gap, as shown in my last
example above.

- Julian


Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
Hi all,

Other priorities have unfortunately kept me from focusing on the
project of speeding up blame. But recently I've spent some time
thinking about it, reading the other mail threads, studying the code
and profiling a little bit. I hope I can still do something useful for
blame, whether it be sooner or later.

I will first give some updates, info and discussion, and end up with a
couple of questions. Any input is greatly appreciated. A lot of this
stuff is still very fuzzy to me :-).

First, some profiling info (After sprinkling some printf's and some
time-counters into blame.c):

- Most of the client-side blame-time (~90%) is spent on producing the
diffs (more specifically, the call to svn_diff_file_diff_2 in the
add_file_blame function in blame.c).

- Applying the deltas to produce full-texts in the first place accounts for ~5%.

- Time spent on producing the blame info (inserting/deleting blame
chunks) is negligible ( < 0.5%).


Coming back with this information to the threads that Mark Phippard suggested:

On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> If you haven't, you should review some of the old threads on speeding
> up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> timeframe and learned a lot about what does NOT speed it up.
>
> http://svn.haxx.se/dev/archive-2005-02/0275.shtml

This thread mainly ended up focusing on the "delta combiner", with the
conclusion of replacing vdelta with xdelta. I think this was a very
good optimization of which we now reap the benefits. If I understand
my profiling numbers correctly, the delta combining stuff is no longer
the biggest bottleneck (at least not on the client-side).

There is however another interesting point/question from this thread, see below.

> You should really search around for all of the threads he commented on
> to get the complete picture.  I think he also provided ideas on what
> more could potentially be done in the future.  Such as this one that I
> do not recall was ever done.
>
> http://svn.haxx.se/dev/archive-2007-09/0459.shtml
>
> Maybe the thread will explain why.

The thread didn't really explain why it wasn't implemented, but the
poster said he had little time to work on it further. The suggestions
were improvements in the processing of the blame info (optimizations
in iterating through the linked list of blame chunks, and discussion
about the most optimal data structure to represent these blame
chunks). The thread sort of ended with the question whether this would
be a significant optimization:
On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
Hi all,

Other priorities have unfortunately kept me from focusing on the
project of speeding up blame. But recently I've spent some time
thinking about it, reading the other mail threads, studying the code
and profiling a little bit. I hope I can still do something useful for
blame, whether it be sooner or later.

I will first give some updates, info and discussion, and end up with a
couple of questions. Any input is greatly appreciated. A lot of this
stuff is still very fuzzy to me :-).

First, some profiling info (After sprinkling some printf's and some
time-counters into blame.c):

- Most of the client-side blame-time (~90%) is spent on producing the
diffs (more specifically, the call to svn_diff_file_diff_2 in the
add_file_blame function in blame.c).

- Applying the deltas to produce full-texts in the first place accounts for ~5%.

- Time spent on producing the blame info (inserting/deleting blame
chunks) is negligible ( < 0.5%).


Coming back with this information to the threads that Mark Phippard suggested:

On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> If you haven't, you should review some of the old threads on speeding
> up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> timeframe and learned a lot about what does NOT speed it up.
>
> http://svn.haxx.se/dev/archive-2005-02/0275.shtml

This thread mainly ended up focusing on the "delta combiner", with the
conclusion of replacing vdelta with xdelta. I think this was a very
good optimization of which we now reap the benefits. If I understand
my profiling numbers correctly, the delta combining stuff is no longer
the biggest bottleneck (at least not on the client-side).

There is however another interesting point/question from this thread, see below.

> You should really search around for all of the threads he commented on
> to get the complete picture.  I think he also provided ideas on what
> more could potentially be done in the future.  Such as this one that I
> do not recall was ever done.
>
> http://svn.haxx.se/dev/archive-2007-09/0459.shtml
>
> Maybe the thread will explain why.

The thread didn't really explain why it wasn't implemented, but the
poster said he had little time to work on it further. The suggestions
were improvements in the processing of the blame info (optimizations
in iterating through the linked list of blame chunks, and discussion
about the most optimal data structure to represent these blame
chunks). The thread sort of ended with the question whether this would
be a significant optimization:
On Tue, Jan 11, 2005, Mark Benedetto King wrote:
> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

>From my profiling data, it seems optimizing those blame_* functions
would indeed not be significant. Most of the time is spent in the
svn_diff_* functions.


Which brings us back to the idea of the binary blame, and Julian's
description of the algorithm. Now, some questions:

1) Dan Berlin's thread prompted an interesting response from Branko
Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
> If by "diff format" you mean the binary delta, then... There was a time
> when I thought this would be possible. Now I'm not so sure. The trouble
> is that the vdelta doesn't generate an edit stream, it generates a
> compressed block-copy stream. Which means that can never be 100% sure,
> just from looking at the delta, which bytes in the target are really new
> and which are just (offset) copies from the source. The only blocks you
> can really be sure about are those that are represented by new data in
> the delta (either NEW blocks or target copies that take data from NEW
> blocks). This problem is made worse by our use of skip deltas in (at
> least) the BDB back-end.
>
> I agree that it would be nice if the server could generate some sort of
> byte-range oriented info, but I don't think it can be done just by
> looking at the deltas. It's sad, I know...

What? So this idea was considered before, but rejected? Is this
argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
around that, does it mean this simply cannot work? Maybe Branko can
comment on that? An example or some more detailed explanation would be
very welcome.

2) A concrete problem with the algorithm as described by Julian would
be the delete of bytes within a line. E.g. suppose one deletes a
single character in r60 ("oops, a boolean test was inverted, let's
delete that stray '!'"). That deleted character wouldn't be
represented in the binary blame structure (it's simply cut out), and
it's obviously not present in the final text. But the line on which it
occurred was now last edited in r60.

Current blame (diff-based) has no problem with this situation. It
correctly shows r60 as the last revision for that line. But the binary
algorithm loses information about the deleted bytes. That's a problem.
I haven't really worked out how to handle that. Anybody have any
ideas?

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard <ma...@gmail.com> wrote:
> On Mon, Mar 22, 2010 at 5:25 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>
>> However, looking at Julian's binary blame algorithm, I can't help but
>> wonder why this binary structure couldn't be calculated on the server
>> just as well. This would save a lot of network roundtrips (5999 in my
>> case :-)). Like I said, network is not an issue in our setup, but I
>> appreciate that there are other environments out there.
>
> Calculating and doing work like this on the server would not scale well.

Oh, of course. I hadn't thought about that. That seems like a very good reason.

Thinking more about this, I'm not sure that it *has* to be like that.
As I commented in the other thread: I think it all depends on how
efficient the algorithm is. Anyway, it's way too early to tell. We'll
see ...

> If you haven't, you should review some of the old threads on speeding
> up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
> timeframe and learned a lot about what does NOT speed it up.
>
> http://svn.haxx.se/dev/archive-2005-02/0275.shtml
>
> You should really search around for all of the threads he commented on
> to get the complete picture.  I think he also provided ideas on what
> more could potentially be done in the future.  Such as this one that I
> do not recall was ever done.
>
> http://svn.haxx.se/dev/archive-2007-09/0459.shtml
>
> Maybe the thread will explain why.

Thanks a lot for these pointers. I'll certainly dig in to them.

--
Johan

Re: Looking to improve performance of svn annotate

Posted by Mark Phippard <ma...@gmail.com>.
On Mon, Mar 22, 2010 at 5:25 PM, Johan Corveleyn <jc...@gmail.com> wrote:

> However, looking at Julian's binary blame algorithm, I can't help but
> wonder why this binary structure couldn't be calculated on the server
> just as well. This would save a lot of network roundtrips (5999 in my
> case :-)). Like I said, network is not an issue in our setup, but I
> appreciate that there are other environments out there.

Calculating and doing work like this on the server would not scale well.

If you haven't, you should review some of the old threads on speeding
up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
timeframe and learned a lot about what does NOT speed it up.

http://svn.haxx.se/dev/archive-2005-02/0275.shtml

You should really search around for all of the threads he commented on
to get the complete picture.  I think he also provided ideas on what
more could potentially be done in the future.  Such as this one that I
do not recall was ever done.

http://svn.haxx.se/dev/archive-2007-09/0459.shtml

Maybe the thread will explain why.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: Looking to improve performance of svn annotate

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Mar 22, 2010 at 12:29 PM, Philipp Marek
<ph...@emerion.com> wrote:
> Hello Julian,
> Hello Johan,

Hi Philipp,

Thanks for joining the discussion. All input is really welcome. Julian
has already responded to your suggestions, but it got me thinking ...
see below.

> On Montag, 22. März 2010, Julian Foad wrote:
> ...
>> My basic idea is an algorithm on the client that takes, as its input, a
>> sequence of binary diffs, one for each revision in which the file
>> changed, representing that incremental change.
> ...
>> The algorithm applies one of these diffs at a time to an in-memory data
>> structure that represents which revision is associated with each byte in
>> the file.
> ...
>> Then, at the end, we read in the blamed file's text (just that one
>> revision of it), match up the byte ranges to the lines in the file, and
>> for each line we display the highest recorded revision number that falls
>> within that line.
> please forgive me for intruding with barely some knowledge to share, but that
> sounds awful because of the round-trips (and latencies) involved.
>
> I'd suggest trying to keep the (current) version of the file line-wise in
> memory (stored as line-number, revision, start byte, length, and actual
> bytes); an incoming binary delta would just change the data and revision,
> until the deltas are exhausted.
>
> Then the blame output can be given directly, without any more round-trips.
>
> Of course, that means splitting xdelta operations into line-wise bits; but, as
> an improvement, you could also print the source line number of a change (by
> just "counting" the lines from top to bottom when applying a delta).

As Julian already mentioned in his further response, currently the
blame is calculated on the client side. And it's true that in my case,
network is not the issue (everything is on a LAN), but client-side
processing is.

However, looking at Julian's binary blame algorithm, I can't help but
wonder why this binary structure couldn't be calculated on the server
just as well. This would save a lot of network roundtrips (5999 in my
case :-)). Like I said, network is not an issue in our setup, but I
appreciate that there are other environments out there.

Or is there a reason this can't be calculated by the server (does it
have not enough information perhaps)? I have always found it quite
strange that the blame is calculated on the client side ...

In any case, I'll probably go for the binary blaming on the client
first (more consistent with how it currently works). Making it "server
calculated" (if that's at all possible) will probably be a much bigger
change with more impact on different parts of the code (I'm just
guessing here).

Johan

Re: Looking to improve performance of svn annotate

Posted by Philipp Marek <ph...@emerion.com>.
Hello Julian,
Hello Johan,

On Montag, 22. März 2010, Julian Foad wrote:
...
> My basic idea is an algorithm on the client that takes, as its input, a
> sequence of binary diffs, one for each revision in which the file
> changed, representing that incremental change.
...
> The algorithm applies one of these diffs at a time to an in-memory data
> structure that represents which revision is associated with each byte in
> the file.
...
> Then, at the end, we read in the blamed file's text (just that one
> revision of it), match up the byte ranges to the lines in the file, and
> for each line we display the highest recorded revision number that falls
> within that line.
please forgive me for intruding with barely some knowledge to share, but that 
sounds awful because of the round-trips (and latencies) involved.

I'd suggest trying to keep the (current) version of the file line-wise in 
memory (stored as line-number, revision, start byte, length, and actual 
bytes); an incoming binary delta would just change the data and revision, 
until the deltas are exhausted.

Then the blame output can be given directly, without any more round-trips.

Of course, that means splitting xdelta operations into line-wise bits; but, as 
an improvement, you could also print the source line number of a change (by 
just "counting" the lines from top to bottom when applying a delta).

> A backward algorithm would be a little more complex ...
Please see my comment at the end of my mail.

> I am still not completely sure whether the binary diff idea can produce
> a line-wise result that looks natural and minimal, while still being
> efficient.  For example, I suspect there may be situations where the
> binary diff will show changes within two different lines where a text
> diff would have showed a change only to one line, but I am not sure.
> Even if effects like this do occur, they might only happen in cases
> where a human reader would agree that that is a good way to describe the
> change anyway.
That might not be true, in case of an ignore-whitespace-changes option or 
something like that.
 
> The important first step is to do some profiling to find out how much
> time is being spent -
>   - on the server, calculating full texts from the stored diffs?
>   - transmitting data over the network?
>   - on the client, producing diffs?
>   - on the client, producing blame info from diffs?
That's always good advise.
 
> I've just re-read your message from last year
> <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=
> 2079688> in which you report it's mainly client-side CPU bound.  That's
>  good to know.  (That thread is linked from the issue.)  So next, you would
>  be well advised to profile the execution of the client code and see
>  whether the textual diffs or some other part of the processing are taking
>  the most time.
> 
> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
> 
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history.  Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?
Well, how about using an iterative approach? Just do the algorithm you 
described for the revisions [N-100:N]; if there's any line that's still at 
N-100 fetch the blame information for [N-200:N-100], and substitute the 
'unknown' lines; repeat as necessary.

Sounds simpler than a full reverse run, and might be efficient enough - only a 
few revisions more than absolutely necessary are fetched and analyzed.


Regards,

Phil

Re: Looking to improve performance of svn annotate

Posted by Julian Foad <ju...@wandisco.com>.
On Sun, 2010-03-21, Johan Corveleyn wrote:
> Hi all,
> 
> I'm very much looking forward to the performance (and other)
> improvements wc-ng will bring us. But there's more to svn performance
> than wc stuff :-). In our company, "svn annotate/blame/praise" is a
> known offender (people who attended subconf 2009 may remember me as
> the guy who complained about the annotate of a 2Mb file with 6000 revs
> taking over 4 hours).
> 
> So, in the name of "make it faster, then make it faster again", and
> feeling that maybe it's time to scratch one of my own itches, I'd like
> to take a stab at this myself. I have no idea whether I'm up to this
> task (I am a total newbie at svn development), but I can try. Until
> now I have just analyzed some parts of the source code. I have setup a
> build environment on my machine (on windows), and have been able to
> run it from inside a debugger (Visual C 2008 Express Edition), setting
> breakpoints, seeing what happens, ...
> 
> I am specifically looking at the suggestion that Julian made in the
> following comment in the issue tracker
> (http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc5):
> 
> [[[
> My idea for client-side improvement:
> 
> The client, while receiving binary diffs for successive revisions, could
> calculate a byte-wise annotation of the file. (It probably would not have to
> reconstruct full texts to do that.) Then, at the end, it could split the
> byte-wise annotation at the file's line boundaries and report the highest
> revision number within each line.
> ]]]

Hi Johan.

Please accept my sincere apologies for not having replied to your
private email about this.  I remember discussing this with you but have
not thought more about it recently.

Let me recall what I can about that binary-blame idea.

My basic idea is an algorithm on the client that takes, as its input, a
sequence of binary diffs, one for each revision in which the file
changed, representing that incremental change.  A binary diff tells
which byte ranges are added, which are unchanged and which are deleted.
The idea is to use Subversion's binary delta format which is transmitted
by the server, so that there is no need to produce full-texts and then
diff them on the client.  (That format expresses adds and copies, with
deletes being represented implicitly by the lack of any add or copy for
a particular byte range, if I recall correctly.)

The algorithm applies one of these diffs at a time to an in-memory data
structure that represents which revision is associated with each byte in
the file.  The representation can be a list of byte ranges covering the
whole length of the file; it doesn't need to store the actual data.

  # byte-range       revision
  {
    0:100         => r50
    100:101       => r49
    101:105       => r50
    105:1105      => r7
    1105:2000000  => r49
  }

(Byte ranges are shown with an exclusive end point: 0:100 means byte
numbers 0 to 99 inclusive.)

To explain the algorithm a little further, let's first assume it takes
diffs starting with the oldest revision and working forward.  (A
backward algorithm will usually terminate sooner, but the forward
algorithm is simpler to explain.)

The starting point is an empty range list, as there is no file:

  { }

The first incoming change is for revision 7 in which the file was
created, let't say with 1000 bytes of new text, so the range list looks
like:

  {
    0:1000        => r7
  }

The next change is in r49, in which one byte is added to the beginning
and 3000000 bytes to the end, so we get:

  {
    0:1           => r49
    1:1001        => r7
    1001:3001001  => r49
  }

And finally r50 adds 100 bytes at the beginning, adds 4 bytes after the
one from r49, and deletes 1001105 bytes altogether in various ranges
scattered through the last big chunk, giving the final revision range
mapping show at the start of this example:

  {
    0:100         => r50
    100:101       => r49
    101:105       => r50
    105:1105      => r7
    1105:2000000  => r49
  }

Then, at the end, we read in the blamed file's text (just that one
revision of it), match up the byte ranges to the lines in the file, and
for each line we display the highest recorded revision number that falls
within that line.

The efficiency of this idea relies on being able quickly to apply each
incoming binary diff onto some such data structure.

A backward algorithm would be a little more complex as it also would
have to maintain a mapping between the byte ranges of the "current"
revision of the text and the byte ranges - if any - of the "blamed"
revision of the text.  The backward algorithm would start with a range
list in which the whole range is marked "unknown revision" and would
successively fill in the known ranges, until no more unknown ranges are
left.  I have not thought further about exactly how to implement this
but I can't see any theoretical difficulty.

I am still not completely sure whether the binary diff idea can produce
a line-wise result that looks natural and minimal, while still being
efficient.  For example, I suspect there may be situations where the
binary diff will show changes within two different lines where a text
diff would have showed a change only to one line, but I am not sure.
Even if effects like this do occur, they might only happen in cases
where a human reader would agree that that is a good way to describe the
change anyway.

(It is not feasible to guarantee the exact same result as the current
blame algorithm, but that is OK because there are multiple ways of
representing some changes anyway.  For example, if r49 is the three
lines { 'A', 'B', 'C' } and r50 is the four lines { 'A', 'B', 'B',
'C' }, then the blame info could be shown as { 49, 49, 50, 49 } or { 49,
50, 49, 49 }.  Either way is correct.)


> I might also take a look at Bert's suggestion in
> http://subversion.tigris.org/issues/show_bug.cgi?id=3074#desc7
> (reversing the algorithm).
> 
> If anyone could help me get started with this (more specifics about
> the above idea(s), or other ideas that could give a significant boost;
> any pointers, parts of the code where I could get started, ...), that
> would be very much appreciated.

The important first step is to do some profiling to find out how much
time is being spent -

  - on the server, calculating full texts from the stored diffs?

  - transmitting data over the network?

  - on the client, producing diffs?

  - on the client, producing blame info from diffs?

I've just re-read your message from last year
<http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2079688> in which you report it's mainly client-side CPU bound.  That's good to know.  (That thread is linked from the issue.)  So next, you would be well advised to profile the execution of the client code and see whether the textual diffs or some other part of the processing are taking the most time.

The clear advantage of the binary diff algorithm is that on the client
it neither calculated diffs nor even looks at the older revisions of the
file's text, so that might be a big win if it works at all.

As regards reversing the current algorithm, first you might want to find
out what is the oldest revision number that shows up in the blame of
your typical file, and compare that with all the changes in the file's
history.  Is it only reporting the most recent 1% of the file's changes,
in which case the reversed algorithm could achieve a 100-fold speed-up,
or 20%, or 80%?


Anyway, I'm very glad you want to work on this.  Welcome, and do not
hesitate to ask questions here or in IRC channel #svn-dev on Freenode
where many of us hang out most of the time, and please post your
investigation results, patches in progress, or anything else to get our
attention.

- Julian