You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Hugh Gibson <hg...@cix.co.uk> on 2005/01/11 13:04:00 UTC

Getting raw blame information

Hi,

I'm looking at ways of speeding up the "blame" output. For example, for 
one of our files which is about 300k with around 1750 total revisions in 
the repository, it takes around 2 minutes.

Looking at the source code (blame.c) it appears that it gets all the 
revisions, collecting information about changed lines, then gets the final 
text and matches that up with all the blame chunks.

I've noticed that performance drops as later revisions are retrieved, and 
the linked list for blame must be the problem. blame_find() does a linear 
search through the chunks, similar to blame_adjust() which has to iterate 
through all the chunks to put in an offset. So there are two O(n^2) 
processes here.

A quick win would be to make the search for *last* in blame_delete_range() 
use *start* as the starting point, not db->blame.

I wondered if a tree structure would be a lot faster (though just as bad 
in the degenerate case). Was this considered at all?

I normally work in Python these days and found blame.py in tools/examples. 
It appears to use a different strategy - doing a full diff between each 
version, which will be a lot slower. But if I can cache information 
locally then maybe I could get away without having to obtain all the log 
information. Is there anything like this around?

Hugh Gibson

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Hugh Gibson <hg...@cix.co.uk>.
> > Hmmm. How many unit tests are there for SVN in general?
> >  
> 
> Lots.  Really.  SVN has a great unit test framework.  

That's reassuring.

> But only two tests have been written for blame, and neither of those 
> address the blame calculation directly.

OK. 

> Run "make check" if you've got a lot of time on your hands.   You can
> look in subversion/tests/clients/cmdline/blame_tests.py for what we've
> got in the way of blame testing.

Time is what I don't have a lot of.
 
> The right thing to do is probably to generalize the blame calculation
> routines and move them to libsvn_subr (or perhaps libsvn_diff), and
> add real unit tests to subversion/tests/libsvn_subr.   This would
> lower the barrier to moving the blame calculation to the server side,
> too.

If it was on the server it wouldn't be limited by the network, so the 
algorithm would be more important. However, I'm not going to be able to 
afford the time to get into it all. I'll think about the algorithm some 
more but the learning curve for SVN development would be too much when 
we're working hard with a small team to get a new product developed.
 
> The wildcards are expanded by the shell on unixy systems, and by magical
> linker tricks on Windows.  That is to say, none of Subversion's code
> handles command-line wildcards directly.  That means that it's always
> okay to pass wildcards to Subversion.
> 
> Is it possible that the wildcard expanded to a directory name, and that
> blame doesn't produce a nice looking message when run on a directory?

Yes, it's more than likely that this was the first directory name. I was 
hoping that the command would be run recursively (like CVS annotate, which 
works on all the files in the repository). I suppose I should just do it 
one by one.

Hugh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Thu, Jan 13, 2005 at 09:07:00AM +0000, Hugh Gibson wrote:
> > > Do you have unit tests for it? That would be important for testing 
> > > any new implementation.
> > 
> > I wish we had unit tests for it.  That would be important for testing
> > any new implementation.
> 
> Hmmm. How many unit tests are there for SVN in general?
>  

Lots.  Really.  SVN has a great unit test framework.  But only two tests
have been written for blame, and neither of those address the blame
calculation directly.

Run "make check" if you've got a lot of time on your hands.   You can
look in subversion/tests/clients/cmdline/blame_tests.py for what we've
got in the way of blame testing.

The right thing to do is probably to generalize the blame calculation
routines and move them to libsvn_subr (or perhaps libsvn_diff), and
add real unit tests to subversion/tests/libsvn_subr.   This would
lower the barrier to moving the blame calculation to the server side,
too.

[...]

> 
> Also, is it OK to pass wildcards to svn blame? I tried using * but got an 
> error message back from the server, which appeared to be complaining about 
> a folder with just a single file in it, which had no revisions (just the 
> initial checkin). I wasn't sure if that was a problem in what I was doing, 
> or with our SVN repository, or with SVN itself.

The wildcards are expanded by the shell on unixy systems, and by magical
linker tricks on Windows.  That is to say, none of Subversion's code
handles command-line wildcards directly.  That means that it's always
okay to pass wildcards to Subversion.

Is it possible that the wildcard expanded to a directory name, and that
blame doesn't produce a nice looking message when run on a directory?

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by kf...@collab.net.
Mark Benedetto King <mb...@lowlatency.com> writes:
> CVS gets line-oriented deltas per revision "for free" from the fact that
> it stores the changes that way internally.  This is a huge win, since
> it doesn't have to do all the diffing again.  Also, IIUC, it does the
> blame calculation server-side, which saves lots of latency and
> bandwidth.

(That's correct.)

> > One other comment: when doing blame for multiple files the output doesn't 
> > include any information about the file that was being accessed.
> 
> We need to decide whether that's a bug or an API change before we
> fix it.  My vote is for "bug".

Agreed, FWIW.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Hugh Gibson <hg...@cix.co.uk>.
> > Do you have unit tests for it? That would be important for testing 
> > any new implementation.
> 
> I wish we had unit tests for it.  That would be important for testing
> any new implementation.

Hmmm. How many unit tests are there for SVN in general?
 
> Yes.  I think it would be reasonable to simply use an array
> implementation with good slice operations, though it would be
> nice to use a representation that handled adjacent lines from the
> same revision in a space-efficient manner.  Even a naive dynarray
> might outperform the current approach.

Yes. Less work allocating blame fragments etc. We need that profile... 

> > If I can get you the raw information (with file content omitted) would 
> > that do? 
> 
> Yes, the raw information would be fine.

It will have to wait until our trial SVN server is back up again. We're 
having problems running cvs2svn on Win32. It hangs...

> > One other comment: when doing blame for multiple files the output 
> > doesn't include any information about the file that was being 
> > accessed.
> > 
> 
> We need to decide whether that's a bug or an API change before we
> fix it.  My vote is for "bug".

Sounds good :-)

Also, is it OK to pass wildcards to svn blame? I tried using * but got an 
error message back from the server, which appeared to be complaining about 
a folder with just a single file in it, which had no revisions (just the 
initial checkin). I wasn't sure if that was a problem in what I was doing, 
or with our SVN repository, or with SVN itself.

Obviously if you use wildcards then the filename information in the output 
is more important.

Hugh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Wed, Jan 12, 2005 at 10:26:00AM +0000, Hugh Gibson wrote:
> > > I'm looking at ways of speeding up the "blame" output. For example, 
> > > for one of our files which is about 300k with around 1750 total 
> > > revisions in the repository, it takes around 2 minutes.
> > > 
> > 
> > Most people find blame to be network-bound, which is why optimizing
> > the algorithm itself hasn't really been addressed.
> 
> Fair comment: I'll compare the time required to simply get all the log 
> information. However, the progress graph shows a classic O(n^2) behaviour, 
> with the first few 100 revisions skipping through quickly, and later 
> revisions taking more and more time (a few seconds each).

I believe you, I was just prefacing my response with an excuse.

[...]

> > > Looking at the source code (blame.c) it appears that it gets all the 
> > > revisions, collecting information about changed lines, then gets the 
> > > final text and matches that up with all the blame chunks.
> > > 
> > > I've noticed that performance drops as later revisions are retrieved, 
> > > and the linked list for blame must be the problem. blame_find() does 
> > > a linear search through the chunks, similar to blame_adjust() which 
> > > has to iterate through all the chunks to put in an offset. So there 
> > > are two O(n^2) processes here.
> > > 
> > > A quick win would be to make the search for *last* in 
> > > blame_delete_range() use *start* as the starting point, not db->blame.
> > > 
> > 
> > I think you mean "first" (not start), but yes, that should improve
> > blame_delete_range().
> 
> Yes, thanks. A couple of other thoughts for quick wins on this algorithm: 
> if you are processing a block of changes for one revision, if they come in 
> the order of the lines, you could check for deletions and additions that 
> match the line numbers, and avoid having to do two lots of blame_adjust. 
> That would be the case with adding comments, correcting a bug on one line 
> etc. Also, you could see where the next change is and only do blame_adjust 
> up to that point. Then the blame_adjust for the rest of the lines would be 
> tweaked to include the +/- for the next change.

They do come in the order of lines.  Those are good ideas, presuming
we can't come up with a datastructure that requires less fixup work
in the first place.

> > > I wondered if a tree structure would be a lot faster (though just as 
> > > bad in the degenerate case). Was this considered at all?
> > > 
> > 
> > Yes.  I figured it would be easier to get the list implementation 
> > correct in time for the 1.0 deadline. :-)
> > 
> > Even the list implementation turned out to have a bug in it!  
> 
> Do you have unit tests for it? That would be important for testing any new 
> implementation.

I wish we had unit tests for it.  That would be important for testing
any new implementation.

>  
> > Just saying "tree" doesn't convince me, though; this list has some
> > fairly special invariants.  Do you have a specific datastructure in
> > mind?  If you're going to go to the trouble of thinking of one, you
> > might try thinking about computing blame backwards in time, rather
> > than forwards, since the backwards calculation could potentially
> > terminate early (once all lines in the blamed revision are accounted
> > for).
> 
> I've been playing with a few ideas. It's a challenge, isn't it? Maybe I 
> need to talk to the "cat" or do the washing up - I usually get my best 
> inspiration then.

Yes.  I think it would be reasonable to simply use an array
implementation with good slice operations, though it would be
nice to use a representation that handled adjacent lines from the
same revision in a space-efficient manner.  Even a naive dynarray
might outperform the current approach.

> 
> > I'd also be interested in profiling output from your 1750-rev file to
> > see how much time is spent in the blame_* functions themselves vs
> > the svn_diff_* functions, so that we can at least give a nod towards
> > Amdahl's Law before we run off optimizing the wrong thing.
> 
> Absolutely. It will probably turn out to be the memory allocator or string 
> concatenation which is the problem, or the server. If I can get you the 
> raw information (with file content omitted) would that do? Note that there 
> are 1750 total revisions in the repository, and only around 228 that apply 
> to this file.

Yes, the raw information would be fine.

[...]

>  
> > Another option would be for us to calculate the changed-line
> > information at commit time (or lazily), store that in the repository,
> > and make it available to clients, but that would be a Big Change.
> 
> Yes. I guess that's something like what CVS does.

CVS gets line-oriented deltas per revision "for free" from the fact that
it stores the changes that way internally.  This is a huge win, since
it doesn't have to do all the diffing again.  Also, IIUC, it does the
blame calculation server-side, which saves lots of latency and
bandwidth.

> 
> One other comment: when doing blame for multiple files the output doesn't 
> include any information about the file that was being accessed.
> 

We need to decide whether that's a bug or an API change before we
fix it.  My vote is for "bug".

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Hugh Gibson <hg...@cix.co.uk>.
> > I'm looking at ways of speeding up the "blame" output. For example, 
> > for one of our files which is about 300k with around 1750 total 
> > revisions in the repository, it takes around 2 minutes.
> > 
> 
> Most people find blame to be network-bound, which is why optimizing
> the algorithm itself hasn't really been addressed.

Fair comment: I'll compare the time required to simply get all the log 
information. However, the progress graph shows a classic O(n^2) behaviour, 
with the first few 100 revisions skipping through quickly, and later 
revisions taking more and more time (a few seconds each).
 
> Just wondering: is this a file that gets automatically appended to, or
> all the deltas distributed throughout the file?

Deltas are distributed. The file should probably be split up anyway.
 
> > Looking at the source code (blame.c) it appears that it gets all the 
> > revisions, collecting information about changed lines, then gets the 
> > final text and matches that up with all the blame chunks.
> > 
> > I've noticed that performance drops as later revisions are retrieved, 
> > and the linked list for blame must be the problem. blame_find() does 
> > a linear search through the chunks, similar to blame_adjust() which 
> > has to iterate through all the chunks to put in an offset. So there 
> > are two O(n^2) processes here.
> > 
> > A quick win would be to make the search for *last* in 
> > blame_delete_range() use *start* as the starting point, not db->blame.
> > 
> 
> I think you mean "first" (not start), but yes, that should improve
> blame_delete_range().

Yes, thanks. A couple of other thoughts for quick wins on this algorithm: 
if you are processing a block of changes for one revision, if they come in 
the order of the lines, you could check for deletions and additions that 
match the line numbers, and avoid having to do two lots of blame_adjust. 
That would be the case with adding comments, correcting a bug on one line 
etc. Also, you could see where the next change is and only do blame_adjust 
up to that point. Then the blame_adjust for the rest of the lines would be 
tweaked to include the +/- for the next change.

> > I wondered if a tree structure would be a lot faster (though just as 
> > bad in the degenerate case). Was this considered at all?
> > 
> 
> Yes.  I figured it would be easier to get the list implementation 
> correct in time for the 1.0 deadline. :-)
> 
> Even the list implementation turned out to have a bug in it!  

Do you have unit tests for it? That would be important for testing any new 
implementation.
 
> Just saying "tree" doesn't convince me, though; this list has some
> fairly special invariants.  Do you have a specific datastructure in
> mind?  If you're going to go to the trouble of thinking of one, you
> might try thinking about computing blame backwards in time, rather
> than forwards, since the backwards calculation could potentially
> terminate early (once all lines in the blamed revision are accounted
> for).

I've been playing with a few ideas. It's a challenge, isn't it? Maybe I 
need to talk to the "cat" or do the washing up - I usually get my best 
inspiration then.

> I'd also be interested in profiling output from your 1750-rev file to
> see how much time is spent in the blame_* functions themselves vs
> the svn_diff_* functions, so that we can at least give a nod towards
> Amdahl's Law before we run off optimizing the wrong thing.

Absolutely. It will probably turn out to be the memory allocator or string 
concatenation which is the problem, or the server. If I can get you the 
raw information (with file content omitted) would that do? Note that there 
are 1750 total revisions in the repository, and only around 228 that apply 
to this file.
 
> > I normally work in Python these days and found blame.py in 
> > tools/examples. It appears to use a different strategy - doing a full 
> > diff between each version, which will be a lot slower. But if I can 
> > cache information locally then maybe I could get away without having 
> > to obtain all the log information. Is there anything like this around?
> > 
> 
> Not that I know of, but there seems to be increasing demand for it.
> I suppose it was only a matter of time (more revisions committed every
> day!).
> 
> You might be able to extend SVN::Mirror or svk or somesuch in order
> to incrementally pull the changes into a local blame cache.  I'll bet
> you the svk guys would eagerly accept your patch.

Hmmm. I'll stick with standard SVN for the moment (which we are just 
switching to from CVS). Maybe in the future...
 
> Another option would be for us to calculate the changed-line
> information at commit time (or lazily), store that in the repository,
> and make it available to clients, but that would be a Big Change.

Yes. I guess that's something like what CVS does.

One other comment: when doing blame for multiple files the output doesn't 
include any information about the file that was being accessed.

Hugh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Getting raw blame information

Posted by Mark Benedetto King <mb...@lowlatency.com>.
On Tue, Jan 11, 2005 at 01:04:00PM +0000, Hugh Gibson wrote:
> Hi,
> 
> I'm looking at ways of speeding up the "blame" output. For example, for 
> one of our files which is about 300k with around 1750 total revisions in 
> the repository, it takes around 2 minutes.
> 

Most people find blame to be network-bound, which is why optimizing
the algorithm itself hasn't really been addressed.

Just wondering: is this a file that gets automatically appended to, or
all the deltas distributed throughout the file?

> Looking at the source code (blame.c) it appears that it gets all the 
> revisions, collecting information about changed lines, then gets the final 
> text and matches that up with all the blame chunks.
> 
> I've noticed that performance drops as later revisions are retrieved, and 
> the linked list for blame must be the problem. blame_find() does a linear 
> search through the chunks, similar to blame_adjust() which has to iterate 
> through all the chunks to put in an offset. So there are two O(n^2) 
> processes here.
> 
> A quick win would be to make the search for *last* in blame_delete_range() 
> use *start* as the starting point, not db->blame.
> 

I think you mean "first" (not start), but yes, that should improve
blame_delete_range().


> I wondered if a tree structure would be a lot faster (though just as bad 
> in the degenerate case). Was this considered at all?
> 

Yes.  I figured it would be easier to get the list implementation correct
in time for the 1.0 deadline. :-)

Even the list implementation turned out to have a bug in it!  

Just saying "tree" doesn't convince me, though; this list has some
fairly special invariants.  Do you have a specific datastructure in
mind?  If you're going to go to the trouble of thinking of one, you
might try thinking about computing blame backwards in time, rather
than forwards, since the backwards calculation could potentially
terminate early (once all lines in the blamed revision are accounted
for).

I'd also be interested in profiling output from your 1750-rev file to
see how much time is spent in the blame_* functions themselves vs
the svn_diff_* functions, so that we can at least give a nod towards
Amdahl's Law before we run off optimizing the wrong thing.

> I normally work in Python these days and found blame.py in tools/examples. 
> It appears to use a different strategy - doing a full diff between each 
> version, which will be a lot slower. But if I can cache information 
> locally then maybe I could get away without having to obtain all the log 
> information. Is there anything like this around?
> 

Not that I know of, but there seems to be increasing demand for it.
I suppose it was only a matter of time (more revisions committed every
day!).

You might be able to extend SVN::Mirror or svk or somesuch in order
to incrementally pull the changes into a local blame cache.  I'll bet
you the svk guys would eagerly accept your patch.

Another option would be for us to calculate the changed-line
information at commit time (or lazily), store that in the repository,
and make it available to clients, but that would be a Big Change.

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org