You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@alice-dsl.de> on 2010/04/25 14:51:48 UTC

[PATCH] Saving a few cycles, part 2/2

Hi devs,

here comes the second set of local optimizations.
It contains a number of simple improvements over
the original code.

-- Stefan^2.

[[[
Numerous local optimizations.

* subversion/include/svn_delta.h
  (enum svn_delta_action): ensure that these definitions
  will always have these values

* subversion/libsvn_delta/compose_delta.c
  (search_offset_index): use size_t to index memory
  (copy_source_ops): dito; optimize a common case

* subversion/libsvn_delta/svndiff.c
  (decode_file_offset, decode_size): use slightly more
  efficient formulations
  (decode_instruction): directly map action codes -
  made possible by defining the enum values
 
* subversion/libsvn_subr/checksum.c
  (svn_checksum_parse_hex): eliminate CTR function calls

* subversion/libsvn_subr/stream.c
  (stream_readline): numbytes is invariant until EOF
]]]

Re: [PATCH] Saving a few cycles, part 2/2

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Wed, May 5, 2010 at 6:28 PM, Bolstridge, Andrew <
andy.bolstridge@intergraph.com> wrote:

> > -----Original Message-----
> > From: hyrum@hyrumwright.org [mailto:hyrum@hyrumwright.org] On Behalf
> Of
> > Hyrum K. Wright
> > Sent: Wednesday, May 05, 2010 3:55 PM
> > To: Peter Samuelson
> > Cc: dev@subversion.apache.org
> > Subject: Re: [PATCH] Saving a few cycles, part 2/2
> >
> ...
> > I would agree.  I saw the commit of Stefan's patch, and independently
> > thought about using a lookup table (though I prefer Peter's
> > implementation).
> >  I don't have any hard performance numbers, but I gotta believe the
> > lookup
> > table is simpler, as well as more performant.
> >
>
> It may be simpler, but generally not as fast as arithmetic. The problem
> (with modern processors) is that the table (or the entry you want) has
> to be loaded into the CPU cache and that may require a main RAM access
> that is very slow (relatively speaking) compared to a few maths
> operations.
>
> For example, you can do many instructions per clock cycle, but fetching
> 1 byte from main RAM (at 10ns delay) will take 30 clock cycles on a 3Ghz
> chip (and that's a lot of maths instructions.... over a thousand).
>

Completely off-topic: I'm interested in how over a thousand math
instructions can be done in only 30 clock cycles.  Even on highly parallel
data (which this example is not), and using a very long pipeline (which
modern processors top out at *maybe* 25 stages), your still talking about
over 33 ops/cycle.  My microarchitecture is a bit rusty, so I'm interested
to know how this works out.

I think it takes a lot longer than that, with slower latency RAM we use
> today and 3 levels of cache. But the point is that a lookup table is no
> longer necessarily faster than calculating the data. How times have
> changed :)
>

These are valid points, but I'm curious about them.  For instance, the
current code has function calls in it.  While the overhead might be lower,
because the compiler could pass arguments in registers instead of on the
stack, the functions most likely wouldn't already be resident in the i-cache
(at least not initially), and so would need to be loaded, just the same as
the lookup table would.  The lookup table, at only 256 bytes, fits in most
d-caches, and would stay resident for the entire loop.

The lookup table method also eliminates branches (hidden by the trinary
operators), which could lead to pipeline stalls.  Since the probability of
taking the branches is uniform (or ought to be with a cryptographic hash),
the branch predictor doesn't really have much chance of doing any good.


> Anyway, it's hardly going to be a big deal to overall performance, so
> whatever is easier to maintain is best.


+1 :)

-Hyrum

RE: [PATCH] Saving a few cycles, part 2/2

Posted by "Bolstridge, Andrew" <an...@intergraph.com>.
> -----Original Message-----
> From: hyrum@hyrumwright.org [mailto:hyrum@hyrumwright.org] On Behalf
Of
> Hyrum K. Wright
> Sent: Wednesday, May 05, 2010 3:55 PM
> To: Peter Samuelson
> Cc: dev@subversion.apache.org
> Subject: Re: [PATCH] Saving a few cycles, part 2/2
> 
...
> I would agree.  I saw the commit of Stefan's patch, and independently
> thought about using a lookup table (though I prefer Peter's
> implementation).
>  I don't have any hard performance numbers, but I gotta believe the
> lookup
> table is simpler, as well as more performant.
> 

It may be simpler, but generally not as fast as arithmetic. The problem
(with modern processors) is that the table (or the entry you want) has
to be loaded into the CPU cache and that may require a main RAM access
that is very slow (relatively speaking) compared to a few maths
operations. 

For example, you can do many instructions per clock cycle, but fetching
1 byte from main RAM (at 10ns delay) will take 30 clock cycles on a 3Ghz
chip (and that's a lot of maths instructions.... over a thousand).

I think it takes a lot longer than that, with slower latency RAM we use
today and 3 levels of cache. But the point is that a lookup table is no
longer necessarily faster than calculating the data. How times have
changed :)

Anyway, it's hardly going to be a big deal to overall performance, so
whatever is easier to maintain is best.

Re: [PATCH] Saving a few cycles, part 2/2

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Mon, Apr 26, 2010 at 6:10 PM, Peter Samuelson <pe...@p12n.org> wrote:

>
> [Philip Martin]
> > We have svn_ctype_isalpha which is a macro that does a table lookup.
> > We don't currently have svn_ctype_isxdigit but it could be added.
> >
> > > I'd rather see the whole thing become a single table lookup.  Untested
> > > patch appended.
> >
> > Is this a bottleneck?  I think using the svn_ctype.h macros should be
> > sufficient.
>
> I don't know if it's a bottleneck.  But given that what we actually
> want to know if 'what is the value of this hex digit' - using isxdigit,
> then isalpha as a conditional, then ASCII arithmetic is awfully
> roundabout compared to just a single table lookup.
>
> I mean, the whole point of those ctype functions is to avoid ASCII
> arithmetic.  So, while I don't claim my patch will be noticeably faster
> (it will be faster but probably not noticeably so), I do claim it's
> more elegant.


I would agree.  I saw the commit of Stefan's patch, and independently
thought about using a lookup table (though I prefer Peter's implementation).
 I don't have any hard performance numbers, but I gotta believe the lookup
table is simpler, as well as more performant.

-Hyrum

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Peter Samuelson <pe...@p12n.org>.
[Philip Martin]
> We have svn_ctype_isalpha which is a macro that does a table lookup.
> We don't currently have svn_ctype_isxdigit but it could be added.
> 
> > I'd rather see the whole thing become a single table lookup.  Untested
> > patch appended.
> 
> Is this a bottleneck?  I think using the svn_ctype.h macros should be
> sufficient.

I don't know if it's a bottleneck.  But given that what we actually
want to know if 'what is the value of this hex digit' - using isxdigit,
then isalpha as a conditional, then ASCII arithmetic is awfully
roundabout compared to just a single table lookup.

I mean, the whole point of those ctype functions is to avoid ASCII
arithmetic.  So, while I don't claim my patch will be noticeably faster
(it will be faster but probably not noticeably so), I do claim it's
more elegant.

Peter

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Philip Martin <ph...@wandisco.com>.
Peter Samuelson <pe...@p12n.org> writes:

> [Stefan Fuhrmann]
>> --- subversion/libsvn_subr/checksum.c	(revision 937673)
>> +++ subversion/libsvn_subr/checksum.c	(working copy)
>> @@ -206,12 +206,39 @@
>>  
>>    for (i = 0; i < len; i++)
>>      {
>> -      if ((! isxdigit(hex[i * 2])) || (! isxdigit(hex[i * 2 + 1])))
>> +      char c1 = hex[i * 2];
>> +      char c2 = hex[i * 2 + 1];
>
> Hmmmm.  On the one hand, your open-coding of isxdigit will be less
> efficient than the original, which uses a table lookup.  On the other
> hand, replacing isalpha is probably for the best, since isalpha is
> locale-sensitive.

We have svn_ctype_isalpha which is a macro that does a table lookup.
We don't currently have svn_ctype_isxdigit but it could be added.

> I'd rather see the whole thing become a single table lookup.  Untested
> patch appended.

Is this a bottleneck?  I think using the svn_ctype.h macros should be
sufficient.

-- 
Philip

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Peter Samuelson <pe...@p12n.org>.
[Stefan Fuhrmann]
> --- subversion/libsvn_subr/checksum.c	(revision 937673)
> +++ subversion/libsvn_subr/checksum.c	(working copy)
> @@ -206,12 +206,39 @@
>  
>    for (i = 0; i < len; i++)
>      {
> -      if ((! isxdigit(hex[i * 2])) || (! isxdigit(hex[i * 2 + 1])))
> +      char c1 = hex[i * 2];
> +      char c2 = hex[i * 2 + 1];

Hmmmm.  On the one hand, your open-coding of isxdigit will be less
efficient than the original, which uses a table lookup.  On the other
hand, replacing isalpha is probably for the best, since isalpha is
locale-sensitive.

I'd rather see the whole thing become a single table lookup.  Untested
patch appended.

Peter
[[[
* subversion/libsvn_subr/checksum.c
  (svn_checksum_parse_hex): Use a single table lookup in place of
   isxdigit, isalpha, and (foo - 'a' + 10) gymnastics.  UNTESTED.
]]]
--- subversion/libsvn_subr/checksum.c
+++ subversion/libsvn_subr/checksum.c
@@ -192,6 +192,16 @@
   int len;
   int i;
   unsigned char is_zeros = '\0';
+  static const unsigned char xdigitval[256] =
+    {
+      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+      1, 2, 3, 4, 5, 6, 7, 8, 9,10, 0, 0, 0, 0, 0, 0,   /* 0-9 */
+      0,11,12,13,14,15,16, 0, 0, 0, 0, 0, 0, 0, 0, 0,   /* A-F */
+      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+      0,11,12,13,14,15,16,                              /* a-f */
+    };
 
   if (hex == NULL)
     {
@@ -206,13 +216,12 @@
 
   for (i = 0; i < len; i++)
     {
-      if ((! isxdigit(hex[i * 2])) || (! isxdigit(hex[i * 2 + 1])))
+      unsigned char x1 = xdigitval((unsigned char)hex[i * 2]);
+      unsigned char x2 = xdigitval((unsigned char)hex[i * 2 + 1]);
+      if (!x1-- || !x2--)
         return svn_error_create(SVN_ERR_BAD_CHECKSUM_PARSE, NULL, NULL);
 
-      ((unsigned char *)(*checksum)->digest)[i] =
-        (( isalpha(hex[i*2]) ? hex[i*2] - 'a' + 10 : hex[i*2] - '0') << 4) |
-        ( isalpha(hex[i*2+1]) ? hex[i*2+1] - 'a' + 10 : hex[i*2+1] - '0');
-      is_zeros |= (*checksum)->digest[i];
+      is_zeros |= ((unsigned char *)(*checksum)->digest)[i] = x1 << 4 | x2;
     }
 
   if (is_zeros == '\0')

Re: [PATCH] Saving a few cycles, part 2/2

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Mon, Apr 26, 2010 at 7:11 PM, Stefan Fuhrmann <
stefanfuhrmann@alice-dsl.de> wrote:

> Julian Foad wrote:
>
>> On Sun, 2010-04-25, Stefan Fuhrmann wrote:
>>
>>
>>> here comes the second set of local optimizations.
>>> It contains a number of simple improvements over
>>> the original code.
>>>
>>>
>>
>> Hi Stefan.  Thanks for working on optimization.  Saving a few cycles can
>> be useful if it's in frequently-executed code so that the overall gain
>> is significant.  At the same time, we are trying to keep a good balance
>> against code simplicity.  Much of our code is more complex and longer
>> than it should be already, so I want to be careful about any change that
>> makes it even longer.
>>
>> Have you got (or please could you get) some measurements for these
>> optimisations that you could show us?  From just looking at the patch, I
>> find it hard to believe that some of the changes make a consistent
>> difference, at least from the perspective of thinking about different
>> compilers.  Specific comments and questions are below.
>>
>>
> All this code is over 3 weeks old now. I simply sent
> the Easter weekend optimizing SVN, APR and zlib.
> The total performance gains are tremendous.
>

Stefan, I've been following the discussion but haven't had too much to
contribute.  I know there's been some good feedback which has led to some
interesting technical discussion.  I'm personally thankful that somebody is
doing this kind of analysis and looking at these types of places that
Subversion can be improved.  Thanks for being patient with the community as
we digest your suggestions.


> Now, I'm in the tedious process of slicing the changes
> into smaller, easily digestible chunks. Once I got them
> all posted, I plan to write an extensive analysis of the
> current SVN performance including several measurements
> for different OS, FS, patch levels, repository formats etc.
>

A report I'm anxious to see!


> Bottom line: Subversion-sized projects could technically
> be checked out in less than 1 second (wire-speed permitting).
>

That would be awesome, especially since many of our users are accessing
repositories on local networks.  There are many pieces to the puzzle of
course, but we should be saturating the network, because it's generally the
scarcest resource.

-Hyrum

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Julian Foad <ju...@wandisco.com>.
Stefan Fuhrmann wrote:
> Julian Foad wrote:
> > I saw in a later part of the patch that this is so you can make the
> > function 'decode_instruction()' rely on those values.  As you know,
> > adding the explicit "=0", "=1", "=2" on the enum definition does not
> > make any difference to the compiler, it just provides a hint to the
> > human developers that these values might be significant.
> >
> > We need a stronger hint than this.  Please add code comments in both the
> > enum definition and the function 'decode_instruction()', saying "The
> > enum values must match the instruction encoding values."
> >   
> Oh! I wasn't aware of that - being a C++ guy.

You may have misunderstood me.  The rules are the same in C and C++.
What I meant is that the compiler will assign the values { 0, 1, 2 }
regardless of whether they are specified explicitly.  It only makes a
difference when the next human maintainer wants to insert another value
in the list ... and then the human sees the numbers and thinks "Hmm, if
I need to insert a new value, perhaps I should make sure I don't
re-number the old values when I insert this new value."

> But as long as comments can "fix" that, so be it ... ;)
[...]


> >>   (copy_source_ops): dito; optimize a common case
> >
> > This is optimizing use of the C 'modulo' operator ('A % B') by
> > hand-coding B=1 and B=2 to use "& (B-1)" instead.  Are you sure your C
> > compiler doesn't already do this?  My intuition expects the compiler to
> > do it.
> >   
> How would the compiler know that ptn_length is often
> 1 or 2? In the general case, such conditional code would
> be a gross pessimization.

My intuition may well be wrong.  My reasoning was that testing for 1 or
2 is very quick, and general case modulo calculation is very slow.
Therefore I think adding a test to every modulo calculation cannot be a
gross pessimization, just a small overhead.  But if typical compilers
don't do it, then your manual optimization is useful.


> >> * subversion/libsvn_delta/svndiff.c
> >>   (decode_file_offset, decode_size): use slightly more
> >>   efficient formulations
> >
> > In these two functions you are expanding two lines of expressions into
> > many simpler lines, and introducing a temporary variable to eliminate
> > multiple loads and stores.
> >
> > I am suspicious of this change.  It is the compiler's job to transform
> > an expression into a longer stream of simple statements, and often the
> > compiler will do a better job than a human.  (Yes I know hand-coded
> > optimisations can sometimes be useful.)
> >   
> Again, the compiler can't do much here: unless it summons
> the daemons of global and feedback optimization (or the
> user tells him not to care), there is no way to be *sure* that
> p and val point to disjoint memory regions. Thus, it cannot
> optimize the *val = f(*val) code.
> 
> My code saves 2 out of 3 memory accesses at the expense
> of an additional write at the end.

OK, so the temporary variable is useful.  Is splitting the expression up
into many small statements also useful?


> > Also, these changes (and the 'modulo' change above) look like the sort
> > of change that could generate faster code on one compiler but slower
> > code on another.
> >   
> I'm very conscious about the impact on other compilers
> and architectures. Therefore, I restrict my extensive arsenal
> of "coding tricks" to what should be beneficial on most
> CPUs and is hard for compilers to "screw up".

Thanks.

> > If you could show the numbers, that would help me/us to understand the
> > trade-off.
> >  
> These changes were added in the later stages of my
> optimization session. At that point, the major parts had
> already gotten streamlined leaving other sections where
> runtime seeped through cracks all over the place.
> 
> Each of the changes "putties" one of the cracks I came
> across. So, individual savings are in the .5 .. 2% range
> but the total effect is somewhere between 3 and 10%
> (depending on platform, compiler, etc.).

Thanks.  My problem is I find it hard to believe that some of these
changes 

- Julian


Re: [PATCH] Saving a few cycles, part 2/2

Posted by Alexey Neyman <st...@att.net>.
On Monday 26 April 2010 05:11:25 pm Stefan Fuhrmann wrote:
> > I saw in a later part of the patch that this is so you can make the
> > function 'decode_instruction()' rely on those values.  As you know,
> > adding the explicit "=0", "=1", "=2" on the enum definition does not
> > make any difference to the compiler, it just provides a hint to the
> > human developers that these values might be significant.
> >
> > We need a stronger hint than this.  Please add code comments in both
> > the enum definition and the function 'decode_instruction()', saying
> > "The enum values must match the instruction encoding values."
>
> Oh! I wasn't aware of that - being a C++ guy.
> But as long as comments can "fix" that, so be it ... ;)

For such things, I find it useful to have some compile-time checks. Since 
the preprocessor cannot compare enumerated values or some other values 
known only at the compilation time, I use something like:

#define compile_assert(x) extern int __hack__compile_assert[(x) ? 1 : -1]

Then, you could add such checks (with comments explaining why this values 
are important) to one of the source files:

compile_assert(svn_txdelta_source == 0);
compile_assert(svn_txdelta_target == 1);
compile_assert(svn_txdelta_new == 2);

Disclaimer: I am not a Subversion developer, so I am not sure if 
Subversion community is willing to have such assertions in the code.

Regards,
Alexey.

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.
Julian Foad wrote:
> On Sun, 2010-04-25, Stefan Fuhrmann wrote:
>   
>> here comes the second set of local optimizations.
>> It contains a number of simple improvements over
>> the original code.
>>     
>
> Hi Stefan.  Thanks for working on optimization.  Saving a few cycles can
> be useful if it's in frequently-executed code so that the overall gain
> is significant.  At the same time, we are trying to keep a good balance
> against code simplicity.  Much of our code is more complex and longer
> than it should be already, so I want to be careful about any change that
> makes it even longer.
>
> Have you got (or please could you get) some measurements for these
> optimisations that you could show us?  From just looking at the patch, I
> find it hard to believe that some of the changes make a consistent
> difference, at least from the perspective of thinking about different
> compilers.  Specific comments and questions are below.
>   
All this code is over 3 weeks old now. I simply sent
the Easter weekend optimizing SVN, APR and zlib.
The total performance gains are tremendous.

Now, I'm in the tedious process of slicing the changes
into smaller, easily digestible chunks. Once I got them
all posted, I plan to write an extensive analysis of the
current SVN performance including several measurements
for different OS, FS, patch levels, repository formats etc.

Bottom line: Subversion-sized projects could technically
be checked out in less than 1 second (wire-speed permitting).
> Also, in some cases you need to include a comment in the code saying
> "This is done this way because it is faster than doing it the simple
> way," otherwise the next person to edit the code will "improve" it back
> to the way it was before.
>   
Good point. Will do.
>   
>> [[[
>> Numerous local optimizations.
>>     
>
> When you write these log messages, please could you write a few words
> about each change to explain why it's an improvement.  They are all
> obvious to you, of course, but when the rest of us look at the patch,
> some of the changes are not obvious at first.  It would be good if each
> us us could read through your patch and think to ourselves, "Yes, I
> agree with Stefan's reasoning for that change."
>
>
>   
>> * subversion/include/svn_delta.h
>>   (enum svn_delta_action): ensure that these definitions
>>   will always have these values
>>     
>
> I saw in a later part of the patch that this is so you can make the
> function 'decode_instruction()' rely on those values.  As you know,
> adding the explicit "=0", "=1", "=2" on the enum definition does not
> make any difference to the compiler, it just provides a hint to the
> human developers that these values might be significant.
>
> We need a stronger hint than this.  Please add code comments in both the
> enum definition and the function 'decode_instruction()', saying "The
> enum values must match the instruction encoding values."
>   
Oh! I wasn't aware of that - being a C++ guy.
But as long as comments can "fix" that, so be it ... ;)
>   
>> * subversion/libsvn_delta/compose_delta.c
>>   (search_offset_index): use size_t to index memory
>>     
>
> It looks like that is just for consistency with the other arguments.  Or
> does it play a part in speed optimization as well?
>   
Basically, it's a consistency improvement. However,
on LP64 and LLP64 systems, it saves a conversion
from int -> ptrdiff_t.
>   
>>   (copy_source_ops): dito; optimize a common case
>>     
>
> This is optimizing use of the C 'modulo' operator ('A % B') by
> hand-coding B=1 and B=2 to use "& (B-1)" instead.  Are you sure your C
> compiler doesn't already do this?  My intuition expects the compiler to
> do it.
>   
How would the compiler know that ptn_length is often
1 or 2? In the general case, such conditional code would
be a gross pessimization.
>   
>> * subversion/libsvn_delta/svndiff.c
>>   (decode_file_offset, decode_size): use slightly more
>>   efficient formulations
>>     
>
> In these two functions you are expanding two lines of expressions into
> many simpler lines, and introducing a temporary variable to eliminate
> multiple loads and stores.
>
> I am suspicious of this change.  It is the compiler's job to transform
> an expression into a longer stream of simple statements, and often the
> compiler will do a better job than a human.  (Yes I know hand-coded
> optimisations can sometimes be useful.)
>   
Again, the compiler can't do much here: unless it summons
the daemons of global and feedback optimization (or the
user tells him not to care), there is no way to be *sure* that
p and val point to disjoint memory regions. Thus, it cannot
optimize the *val = f(*val) code.

My code saves 2 out of 3 memory accesses at the expense
of an additional write at the end.
> Also, these changes (and the 'modulo' change above) look like the sort
> of change that could generate faster code on one compiler but slower
> code on another.
>   
I'm very conscious about the impact on other compilers
and architectures. Therefore, I restrict my extensive arsenal
of "coding tricks" to what should be beneficial on most
CPUs and is hard for compilers to "screw up".
> If you could show the numbers, that would help me/us to understand the
> trade-off.
>
>   
These changes were added in the later stages of my
optimization session. At that point, the major parts had
already gotten streamlined leaving other sections where
runtime seeped through cracks all over the place.

Each of the changes "putties" one of the cracks I came
across. So, individual savings are in the .5 .. 2% range
but the total effect is somewhere between 3 and 10%
(depending on platform, compiler, etc.).
>>   (decode_instruction): directly map action codes -
>>   made possible by defining the enum values
>>  
>> * subversion/libsvn_subr/checksum.c
>>   (svn_checksum_parse_hex): eliminate CTR function calls
>>     
>
> Did you mean 'CRT' (C Run-Time library)?  If so, would
> svn_ctype_isxdigit() be useful, or does calling that have the same
> overhead as calling the CRT?  It seems wrong to write out ASCII-hex
> conversion functions in long-hand code, and is the sort of code I would
> normally replace by a function call whenever I see it.  At least make it
> a separate function or macro in the source file, with a comment
> explaining why the system 'isxdigit' or 'svn_ctype_isxdigit' is not
> being used.
>
>   
My problem was that I couldn't see what CRT function
took *that* long. Parsing a checksum should be so rare
that it wouldn't even show up in the profiler. But it did.

Elsethread, isalpha() got blamed for it. I will change my
code using the feedback I got.

-- Stefan^2.

Re: [PATCH] Saving a few cycles, part 2/2

Posted by Julian Foad <ju...@wandisco.com>.
On Sun, 2010-04-25, Stefan Fuhrmann wrote:
> here comes the second set of local optimizations.
> It contains a number of simple improvements over
> the original code.

Hi Stefan.  Thanks for working on optimization.  Saving a few cycles can
be useful if it's in frequently-executed code so that the overall gain
is significant.  At the same time, we are trying to keep a good balance
against code simplicity.  Much of our code is more complex and longer
than it should be already, so I want to be careful about any change that
makes it even longer.

Have you got (or please could you get) some measurements for these
optimisations that you could show us?  From just looking at the patch, I
find it hard to believe that some of the changes make a consistent
difference, at least from the perspective of thinking about different
compilers.  Specific comments and questions are below.

Also, in some cases you need to include a comment in the code saying
"This is done this way because it is faster than doing it the simple
way," otherwise the next person to edit the code will "improve" it back
to the way it was before.


> [[[
> Numerous local optimizations.

When you write these log messages, please could you write a few words
about each change to explain why it's an improvement.  They are all
obvious to you, of course, but when the rest of us look at the patch,
some of the changes are not obvious at first.  It would be good if each
us us could read through your patch and think to ourselves, "Yes, I
agree with Stefan's reasoning for that change."


> * subversion/include/svn_delta.h
>   (enum svn_delta_action): ensure that these definitions
>   will always have these values

I saw in a later part of the patch that this is so you can make the
function 'decode_instruction()' rely on those values.  As you know,
adding the explicit "=0", "=1", "=2" on the enum definition does not
make any difference to the compiler, it just provides a hint to the
human developers that these values might be significant.

We need a stronger hint than this.  Please add code comments in both the
enum definition and the function 'decode_instruction()', saying "The
enum values must match the instruction encoding values."


> * subversion/libsvn_delta/compose_delta.c
>   (search_offset_index): use size_t to index memory

It looks like that is just for consistency with the other arguments.  Or
does it play a part in speed optimization as well?

>   (copy_source_ops): dito; optimize a common case

This is optimizing use of the C 'modulo' operator ('A % B') by
hand-coding B=1 and B=2 to use "& (B-1)" instead.  Are you sure your C
compiler doesn't already do this?  My intuition expects the compiler to
do it.

> * subversion/libsvn_delta/svndiff.c
>   (decode_file_offset, decode_size): use slightly more
>   efficient formulations

In these two functions you are expanding two lines of expressions into
many simpler lines, and introducing a temporary variable to eliminate
multiple loads and stores.

I am suspicious of this change.  It is the compiler's job to transform
an expression into a longer stream of simple statements, and often the
compiler will do a better job than a human.  (Yes I know hand-coded
optimisations can sometimes be useful.)

Also, these changes (and the 'modulo' change above) look like the sort
of change that could generate faster code on one compiler but slower
code on another.

If you could show the numbers, that would help me/us to understand the
trade-off.


>   (decode_instruction): directly map action codes -
>   made possible by defining the enum values
>  
> * subversion/libsvn_subr/checksum.c
>   (svn_checksum_parse_hex): eliminate CTR function calls

Did you mean 'CRT' (C Run-Time library)?  If so, would
svn_ctype_isxdigit() be useful, or does calling that have the same
overhead as calling the CRT?  It seems wrong to write out ASCII-hex
conversion functions in long-hand code, and is the sort of code I would
normally replace by a function call whenever I see it.  At least make it
a separate function or macro in the source file, with a comment
explaining why the system 'isxdigit' or 'svn_ctype_isxdigit' is not
being used.


> * subversion/libsvn_subr/stream.c
>   (stream_readline): numbytes is invariant until EOF
> ]]]


> plain text document attachment (PerformanceNumerics.patch):
> Index: subversion/include/svn_delta.h
> ===================================================================
> --- subversion/include/svn_delta.h	(revision 937673)
> +++ subversion/include/svn_delta.h	(working copy)
> @@ -100,7 +100,7 @@
>       * It must be the case that 0 <= @a offset < @a offset +
>       * @a length <= size of source view.
>       */
> -    svn_txdelta_source,
> +    svn_txdelta_source = 0,
>  
>      /** Append the @a length bytes at @a offset in the target view, to the
>       * target.
> @@ -119,7 +119,7 @@
>       * useful in encoding long runs of consecutive characters, long runs
>       * of CR/LF pairs, etc.
>       */
> -    svn_txdelta_target,
> +    svn_txdelta_target = 1,
>  
>      /** Append the @a length bytes at @a offset in the window's @a new string
>       * to the target.
> @@ -129,7 +129,7 @@
>       * order with no overlap at the moment; svn_txdelta_to_svndiff()
>       * depends on this.
>       */
> -    svn_txdelta_new
> +    svn_txdelta_new = 2
>  };
>  
>  /** A single text delta instruction.  */
> Index: subversion/libsvn_delta/compose_delta.c
> ===================================================================
> --- subversion/libsvn_delta/compose_delta.c	(revision 937673)
> +++ subversion/libsvn_delta/compose_delta.c	(working copy)
> @@ -165,10 +165,10 @@
>     as hint because most lookups come as a sequence of decreasing values
>     for OFFSET and they concentrate on the lower end of the array. */
>  
> -static int
> -search_offset_index(const offset_index_t *ndx, apr_size_t offset, int hint)
> +static apr_size_t
> +search_offset_index(const offset_index_t *ndx, apr_size_t offset, apr_size_t hint)
>  {
> -  int lo, hi, op;
> +  apr_size_t lo, hi, op;
>  
>    assert(offset < ndx->offs[ndx->length]);
>  
> @@ -635,13 +635,13 @@
>  static void
>  copy_source_ops(apr_size_t offset, apr_size_t limit,  
>                  apr_size_t target_offset,
> -                int hint,
> +                apr_size_t hint,
>                  svn_txdelta__ops_baton_t *build_baton,
>                  const svn_txdelta_window_t *window,
>                  const offset_index_t *ndx,
>                  apr_pool_t *pool)
>  {
> -  int op_ndx = search_offset_index(ndx, offset, hint);
> +  apr_size_t op_ndx = search_offset_index(ndx, offset, hint);
>    for (;; ++op_ndx)
>      {
>        const svn_txdelta_op_t *const op = &window->ops[op_ndx];
> @@ -694,7 +694,10 @@
>                   The idea here is to transpose the pattern, then generate
>                   another overlapping copy. */
>                const apr_size_t ptn_length = off[0] - op->offset;
> -              const apr_size_t ptn_overlap = fix_offset % ptn_length;
> +              const apr_size_t ptn_overlap = ptn_length > 2 
> +                                           ? fix_offset % ptn_length 
> +                                           : fix_offset & (ptn_length-1);
> +
>                apr_size_t fix_off = fix_offset;
>                apr_size_t tgt_off = target_offset;
>                assert(ptn_length > ptn_overlap);
> Index: subversion/libsvn_delta/svndiff.c
> ===================================================================
> --- subversion/libsvn_delta/svndiff.c	(revision 937673)
> +++ subversion/libsvn_delta/svndiff.c	(working copy)
> @@ -361,16 +361,24 @@
>                     const unsigned char *p,
>                     const unsigned char *end)
>  {
> +  svn_filesize_t temp = 0;
>    if (p + MAX_ENCODED_INT_LEN < end)
>      end = p + MAX_ENCODED_INT_LEN;
>    /* Decode bytes until we're done.  */
> -  *val = 0;
>    while (p < end)
>      {
> -      *val = (*val << 7) | (*p & 0x7f);
> -      if (((*p++ >> 7) & 0x1) == 0)
> +      unsigned c = *p;

Here 'c' is 'unsigned' meaning 'unsigned int'.  (Please write 'unsigned
int' in full if that's intended.)  Shouldn't it be 'unsigned char', or
do you think using an int is faster?  Or, as in the next change below,
did you want it to match the type of 'temp' - in this case
svn_filesize_t?

> +      temp = (temp << 7) | (c & 0x7f);
> +      ++p;
> +
> +      if (c < 0x80)
> +      {
> +        *val = temp;
>          return p;
> +      }
>      }
> +
> +  *val = temp;
>    return NULL;
>  }
>  
> @@ -382,16 +390,24 @@
>              const unsigned char *p,
>              const unsigned char *end)
>  {
> +  apr_size_t temp = 0;

Style nit: please put a blank line after the declarations block, here
and after declaring 'c' below, and also in the function above.

>    if (p + MAX_ENCODED_INT_LEN < end)
>      end = p + MAX_ENCODED_INT_LEN;
>    /* Decode bytes until we're done.  */
> -  *val = 0;
>    while (p < end)
>      {
> -      *val = (*val << 7) | (*p & 0x7f);
> -      if (((*p++ >> 7) & 0x1) == 0)
> +      apr_size_t c = *p;

(By contrast, here 'c' is 'apr_size_t'.)

> +      temp = (temp << 7) | (c & 0x7f);
> +      ++p;
> +
> +      if (c < 0x80)
> +      {
> +        *val = temp;
>          return p;
> +      }
>      }
> +
> +  *val = temp;
>    return NULL;
>  }
>  
> @@ -458,27 +474,30 @@
>                     const unsigned char *p,
>                     const unsigned char *end)
>  {
> +  apr_size_t c;
> +  apr_size_t action;
> +
>    if (p == end)
>      return NULL;
>  
>    /* Decode the instruction selector.  */
> -  switch ((*p >> 6) & 0x3)
> -    {
> -    case 0x0: op->action_code = svn_txdelta_source; break;
> -    case 0x1: op->action_code = svn_txdelta_target; break;
> -    case 0x2: op->action_code = svn_txdelta_new; break;
> -    case 0x3: return NULL;
> -    }
> +  c = *p;
> +  p++;

Write "c = *p++".  (Try not to make the code more long-winded than
necessary.)

> +  action = (c >> 6) & 0x3;
> +  if (action >= 0x3)
> +      return NULL;
>  
> +  op->action_code = (enum svn_delta_action)(action);
> +
>    /* Decode the length and offset.  */
> -  op->length = *p++ & 0x3f;
> +  op->length = c & 0x3f;
>    if (op->length == 0)
>      {
>        p = decode_size(&op->length, p, end);
>        if (p == NULL)
>          return NULL;
>      }
> -  if (op->action_code != svn_txdelta_new)
> +  if (action != svn_txdelta_new)
>      {
>        p = decode_size(&op->offset, p, end);
>        if (p == NULL)
> Index: subversion/libsvn_subr/checksum.c
> ===================================================================
> --- subversion/libsvn_subr/checksum.c	(revision 937673)
> +++ subversion/libsvn_subr/checksum.c	(working copy)
> @@ -206,12 +206,39 @@
>  
>    for (i = 0; i < len; i++)
>      {
> -      if ((! isxdigit(hex[i * 2])) || (! isxdigit(hex[i * 2 + 1])))
> +      char c1 = hex[i * 2];
> +      char c2 = hex[i * 2 + 1];
> +
> +      svn_boolean_t bad = FALSE;
> +      unsigned char upper;
> +      unsigned char lower;
> +
> +      if (c1 > '9')
> +        {
> +          bad = (c1 > 'f') || (c1 < 'a');
> +          upper = c1 - 'a' + 10;
> +        }
> +      else
> +        {
> +          bad = c1 < '0';
> +          upper = c1 - '0';
> +        }
> +
> +      if (c2 > '9')
> +        {
> +          bad |= (c2 > 'f') || (c2 < 'a');
> +          lower = c2 - 'a' + 10;
> +        }
> +      else
> +        {
> +          bad |= c2 < '0';
> +          lower = c2 - '0';
> +        }
> +
> +      if (bad)
>          return svn_error_create(SVN_ERR_BAD_CHECKSUM_PARSE, NULL, NULL);
>  
> -      ((unsigned char *)(*checksum)->digest)[i] =
> -        (( isalpha(hex[i*2]) ? hex[i*2] - 'a' + 10 : hex[i*2] - '0') << 4) |
> -        ( isalpha(hex[i*2+1]) ? hex[i*2+1] - 'a' + 10 : hex[i*2+1] - '0');
> +      ((unsigned char *)(*checksum)->digest)[i] = (upper << 4) | lower;
>        is_zeros |= (*checksum)->digest[i];
>      }
>  
> Index: subversion/libsvn_subr/stream.c
> ===================================================================
> --- subversion/libsvn_subr/stream.c	(revision 937673)
> +++ subversion/libsvn_subr/stream.c	(working copy)
> @@ -355,9 +355,9 @@
>  
>        /* Read into STR up to and including the next EOL sequence. */
>        match = eol_str;
> +      numbytes = 1;
>        while (*match)
>          {
> -          numbytes = 1;

'numbytes' is indeed invariant until the end of the loop, but this
change is surely in the category of micro-optimisation.  Unless you show
me a significant measured gain, we should instead move the declaration
of 'numbytes' *into* the scope of this block, for code readability.

Then we should re-design the whole "read line" approach to avoid the
need for one-character-at-a-time calls to the stream's "read" function.
(Such a re-design would require API changes.)

>            SVN_ERR(svn_stream_read(stream, &c, &numbytes));
>            if (numbytes != 1)
>              {
>  

- Julian