You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@wandisco.com> on 2010/10/07 14:07:42 UTC

Re: Performance optimization - svn_stringbuf_appendbyte()

> New Revision: 997203
> 
> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
> Log:
> Merge r985037, r985046, r995507 and r995603 from the performance branch.
> 
> These changes introduce the svn_stringbuf_appendbyte() function, which has
> significantly less overhead than svn_stringbuf_appendbytes(), and can be
> used in a number of places within our codebase.

Hi Stefan2.

I have been wondering about the actual benefit of such tightly
hand-optimized code.  Today I got around to giving it a quick spin.

In my first test, it made no difference whatsoever, because the
optimized code is never executed.  The recent merge r1002898 of r1001413
from the performance branch introduced a bug:

-  if (str->blocksize > old_len + 1)
+  if (str->blocksize < old_len + 1)

which totally disables the optimized code path.

Fixed in r1005437.

That solved, a loop doing a million simple appendbyte calls runs more
than twice as fast as calling appendbytes(..., 1).  That's fantastic.

But what about that hand-optimization?  I wrote a simple version of the
function, and called it 'appendbyte0':

svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
{
  if (str->blocksize > str->len + 1)
    {
      str->data[str->len++] = byte;
      str->data[str->len] = '\0';
    }
  else
    svn_stringbuf_appendbytes(str, &byte, 1);
}

Here are the results (see full patch attached):

Times:  appendbytes appendbyte0 appendbyte  (in ms)
run:          89          31          34
run:          88          30          35
run:          88          31          34
run:          88          30          34
run:          88          31          34
min:          88          30          34

This tells me that the hand-optimization is actually harmful and the
compiler does a 10% better job by itself.

Have I made a mistake?

What are the results for your system?

(I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)

- Julian


Re: Performance optimization - svn_stringbuf_appendbyte()

Posted by Philip Martin <ph...@wandisco.com>.
Julian Foad <ju...@wandisco.com> writes:

> This tells me that the hand-optimization is actually harmful and the
> compiler does a 10% better job by itself.
>
> Have I made a mistake?
>
> What are the results for your system?
>
> (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)

On my system (Intel Core 2 Duo, GCC 4.3.2, 64bit), the difference is
smaller, about 5%.

-- 
Philip

Re: Performance optimization - svn_stringbuf_appendbyte()

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.
On 11.10.2010 17:07, Julian Foad wrote:
> On Sun, 2010-10-10, Stefan Fuhrmann wrote:
>> On 07.10.2010 16:07, Julian Foad wrote:
>>>> New Revision: 997203
>>>>
>>>> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
>>>> Log:
>>>> Merge r985037, r985046, r995507 and r995603 from the performance branch.
>>>>
>>>> These changes introduce the svn_stringbuf_appendbyte() function, which has
>>>> significantly less overhead than svn_stringbuf_appendbytes(), and can be
>>>> used in a number of places within our codebase.
>>> Here are the results (see full patch attached):
>>>
>>> Times:  appendbytes appendbyte0 appendbyte  (in ms)
>>> run:          89          31          34
>>> run:          88          30          35
>>> run:          88          31          34
>>> run:          88          30          34
>>> run:          88          31          34
>>> min:          88          30          34
>>>
>>> This tells me that the hand-optimization is actually harmful and the
>>> compiler does a 10% better job by itself.
>>>
>>> Have I made a mistake?
>> My guess is that you might have made two mistakes actually.
> Heh, OK.
>
>> First, the number of operations was relatively low - everything
>> in the low ms range could be due to secondary effects (and
>> still be repeatable).
> I can't think of any reason why.  I ran the whole benchmark lots of
> times.  Occasionally some of the times were a big chunk longer due to
> other system activity.  Normally they were stable.  I also ran it
> several times with 10 million ops instead of 1 million, and the results
> were exactly 10x longer with the same degree of variability.
>
OK. It is just that sometimes, process startup or caching
artifacts (especially running tests in a particular order)
result in producible effects of that magnitude. But obviously
that was not the case here.
>> The most important problem would be compiler optimizations
>> or lack thereof. Since the numbers are very large (50..100
>> ticks per byte, depending on your CPU clock), I would assume
>> that you used a normal debug build for testing. In that case,
> You're right.  I'll try again, with "--disable-debug -O2".
Gotcha! The optimizer's impact is massive on that kind
of code.

>> the actual number of C statements has a large impact on
>> the execution time. See my results below for details.
>>> What are the results for your system?
>>>
>>> (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
>>>
>> Test code used (10^10 calls, newer re-allocate):
>>
>>           int i;
>>           unsigned char c;
>>
>>           svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
> OK so you're avoiding any calls to the "re-alloc".  My tests included
> re-allocs, but were long enough strings that this wasn't much overhead.
> Nevertheless I'll avoid re-allocs so we can compare results fairly.
>
>>           for (i = 0; i<  50000000; ++i)
>>           {
>>               s->len = 0;
>>               for (c = 0; c<  200; ++c)
>>                   svn_stringbuf_appendbyte (s, c);
>>           }
>>
>>
>> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
>> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
>>
>> (1)  10,7s; IPC = 2.1
>> (2)  8,11s; IPC = 1.4
>> (3)  2,64s; IPC = 2.4
>> (4)  2,43s; IPC = 2.3
Grml. After normalizing the numbers to 10^9 iterations,
I forgot to adjust the example code. Sorry!
>> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
>> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in
>> appbytes
>> (3) appendbyte0 (same compiler output as the the non-handtuned code);
>>       13 inst. in appbyte, 6 in main
>> (4) trunk@HEAD appendbyte; 11 inst. in appbyte, 6 in main
>>
>> Core2 2.4GHz, Win32, VS2010
>> (not using valgrind to count instructions here; also not testing (2))
>>
>> (1)  17,0s release, 20,2s debug
>> (3)  10,6s release, 12,2s debug
>> (4)  9,7s release, 13,0s debug
> With a test harness more similar to yours, and built with
> "--disable-debug -O2", here are my relative numbers (but only 1 million
> outer loops, compared to your 50 million):
>
> $ svn-c-test subr string 24
> Times for 1000000 x 200 bytes, in seconds
>    (1)appendbytes (3)appendbyte0 (4)trunk@HEAD (5)macro
> run:        7.03        2.06        1.37        0.53
> run:        6.62        1.76        1.55        0.53
> run:        6.53        1.67        1.44        0.53
> run:        6.54        1.60        1.44        0.53
> run:        6.52        1.84        1.37        0.53
> min:        6.52        1.60        1.37        0.53
>
> This agrees with what you found: the hand-tuned code gives a 10%
> improvement.
>
> This also shows another variant, writing the function as a macro,
> reported in the column "(5)macro".  This gives a further two-and-a-half
> times speed-up in this test.
>
The macro variant (basically inlining) would take 2.65s for 10^9
total iterations. That is more or less what I got on Core i7 for the
non-inlined variant. My theory:

* function call & return are cheap jumps
* the code is small enough for the loop stream decoder to kick in
* you platform ABI uses stack frames and probably passes parameters
   on the stack (instead of registers).

The latter is also the case for 32 bit Windows and gives similar
results (9.7s vs. your 6.8s).
> [...]
>> * your numbers seem very large in comparison
>>     ->  probably something else going on
> Can you use my test code or show me yours?  Mine has the harness in
> "string-test.c" and the code under test in libsvn_subr - see attachment
> - and the default build on my platform loads libsvn_subr as a shared
> library.  Maybe yours is completely different.
I think our test results are consistent now. For simple
tests like that, I usually misuse svn.exe or so and put
the code into main().
> Other observations:
>
> - Some callers that use very tight loops are in fact simply copying a
> string, up as far as some particular end marker.  They may benefit from
> being re-written to first determine how much they want to copy and then
> call svn_stringbuf_appendbytyes().
I will browse through the code and see where it can
be reworked easily.
> - The stringbuf type initially allocates a 1-byte space, and doubles
> this every time it re-allocates.  Could we not could allocate at least a
> few bytes (4? 8?) as a minimum, without actually consuming more memory
> in practice?  That would avoid two or three re-allocs in some usage
> patterns.
True. Two points from my side:

* APR pools allocate memory in 8 byte chunks.
   -> allocating just one byte does not save any memory
   over allocating 8
   -> will post a patch for that
* I added a svn_stringbuf_create_empty() function that
   creates a truly empty string (zero capacity) in r985697
   and used it r985673. That is usefuly if the actual required
   capacity will be known only later - but before data is being
   added to the string. Alas, these empty strings seem to
   break some assumptions and cannot be used freely
   in all suitable places. I tried that and got errors.

-- Stefan^2.

Re: Performance optimization - svn_stringbuf_appendbyte()

Posted by Julian Foad <ju...@wandisco.com>.
On Sun, 2010-10-10, Stefan Fuhrmann wrote:
> On 07.10.2010 16:07, Julian Foad wrote:
> >> New Revision: 997203
> >>
> >> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
> >> Log:
> >> Merge r985037, r985046, r995507 and r995603 from the performance branch.
> >>
> >> These changes introduce the svn_stringbuf_appendbyte() function, which has
> >> significantly less overhead than svn_stringbuf_appendbytes(), and can be
> >> used in a number of places within our codebase.
> > Hi Stefan2.
> >
> > I have been wondering about the actual benefit of such tightly
> > hand-optimized code.  Today I got around to giving it a quick spin.
> >
> > In my first test, it made no difference whatsoever, because the
> > optimized code is never executed.  The recent merge r1002898 of r1001413
> > from the performance branch introduced a bug:
> >
> > -  if (str->blocksize>  old_len + 1)
> > +  if (str->blocksize<  old_len + 1)
> WTF how did that happen?
> Well, that's what reviews are for ;)
> > which totally disables the optimized code path.
> >
> > Fixed in r1005437.
> Thanks.
> > That solved, a loop doing a million simple appendbyte calls runs more
> > than twice as fast as calling appendbytes(..., 1).  That's fantastic.
> >
> > But what about that hand-optimization?  I wrote a simple version of the
> > function, and called it 'appendbyte0':
> >
> > svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> > {
> >    if (str->blocksize>  str->len + 1)
> >      {
> >        str->data[str->len++] = byte;
> >        str->data[str->len] = '\0';
> >      }
> >    else
> >      svn_stringbuf_appendbytes(str,&byte, 1);
> > }
> >
> > Here are the results (see full patch attached):
> >
> > Times:  appendbytes appendbyte0 appendbyte  (in ms)
> > run:          89          31          34
> > run:          88          30          35
> > run:          88          31          34
> > run:          88          30          34
> > run:          88          31          34
> > min:          88          30          34
> >
> > This tells me that the hand-optimization is actually harmful and the
> > compiler does a 10% better job by itself.
> >
> > Have I made a mistake?
> My guess is that you might have made two mistakes actually.

Heh, OK.

> First, the number of operations was relatively low - everything
> in the low ms range could be due to secondary effects (and
> still be repeatable).

I can't think of any reason why.  I ran the whole benchmark lots of
times.  Occasionally some of the times were a big chunk longer due to
other system activity.  Normally they were stable.  I also ran it
several times with 10 million ops instead of 1 million, and the results
were exactly 10x longer with the same degree of variability.

> The most important problem would be compiler optimizations
> or lack thereof. Since the numbers are very large (50..100
> ticks per byte, depending on your CPU clock), I would assume
> that you used a normal debug build for testing. In that case,

You're right.  I'll try again, with "--disable-debug -O2".

> the actual number of C statements has a large impact on
> the execution time. See my results below for details.
> > What are the results for your system?
> >
> > (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
> >
> Test code used (10^10 calls, newer re-allocate):
> 
>          int i;
>          unsigned char c;
> 
>          svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);

OK so you're avoiding any calls to the "re-alloc".  My tests included
re-allocs, but were long enough strings that this wasn't much overhead.
Nevertheless I'll avoid re-allocs so we can compare results fairly.

>          for (i = 0; i < 50000000; ++i)
>          {
>              s->len = 0;
>              for (c = 0; c < 200; ++c)
>                  svn_stringbuf_appendbyte (s, c);
>          }
> 
> 
> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
> 
> (1)  10,7s; IPC = 2.1
> (2)  8,11s; IPC = 1.4
> (3)  2,64s; IPC = 2.4
> (4)  2,43s; IPC = 2.3

> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in 
> appbytes
> (3) appendbyte0 (same compiler output as the the non-handtuned code);
>      13 inst. in appbyte, 6 in main
> (4) trunk@HEAD appendbyte; 11 inst. in appbyte, 6 in main
> 
> Core2 2.4GHz, Win32, VS2010
> (not using valgrind to count instructions here; also not testing (2))
> 
> (1)  17,0s release, 20,2s debug
> (3)  10,6s release, 12,2s debug
> (4)  9,7s release, 13,0s debug

With a test harness more similar to yours, and built with
"--disable-debug -O2", here are my relative numbers (but only 1 million
outer loops, compared to your 50 million):

$ svn-c-test subr string 24
Times for 1000000 x 200 bytes, in seconds
  (1)appendbytes (3)appendbyte0 (4)trunk@HEAD (5)macro
run:        7.03        2.06        1.37        0.53
run:        6.62        1.76        1.55        0.53
run:        6.53        1.67        1.44        0.53
run:        6.54        1.60        1.44        0.53
run:        6.52        1.84        1.37        0.53
min:        6.52        1.60        1.37        0.53

This agrees with what you found: the hand-tuned code gives a 10%
improvement.

This also shows another variant, writing the function as a macro,
reported in the column "(5)macro".  This gives a further two-and-a-half
times speed-up in this test.


[...]
> * your numbers seem very large in comparison
>    -> probably something else going on

Can you use my test code or show me yours?  Mine has the harness in
"string-test.c" and the code under test in libsvn_subr - see attachment
- and the default build on my platform loads libsvn_subr as a shared
library.  Maybe yours is completely different.


Other observations:

- Some callers that use very tight loops are in fact simply copying a
string, up as far as some particular end marker.  They may benefit from
being re-written to first determine how much they want to copy and then
call svn_stringbuf_appendbytyes().

- The stringbuf type initially allocates a 1-byte space, and doubles
this every time it re-allocates.  Could we not could allocate at least a
few bytes (4? 8?) as a minimum, without actually consuming more memory
in practice?  That would avoid two or three re-allocs in some usage
patterns.
 
- Julian


Re: Performance optimization - svn_stringbuf_appendbyte()

Posted by Julian Foad <ju...@wandisco.com>.
On Sun, 2010-10-10, Stefan Fuhrmann wrote:
> On 07.10.2010 16:07, Julian Foad wrote:
> >> New Revision: 997203
> >>
> >> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
> >> Log:
> >> Merge r985037, r985046, r995507 and r995603 from the performance branch.
> >>
> >> These changes introduce the svn_stringbuf_appendbyte() function, which has
> >> significantly less overhead than svn_stringbuf_appendbytes(), and can be
> >> used in a number of places within our codebase.
> > Hi Stefan2.
> >
> > I have been wondering about the actual benefit of such tightly
> > hand-optimized code.  Today I got around to giving it a quick spin.
> >
> > In my first test, it made no difference whatsoever, because the
> > optimized code is never executed.  The recent merge r1002898 of r1001413
> > from the performance branch introduced a bug:
> >
> > -  if (str->blocksize>  old_len + 1)
> > +  if (str->blocksize<  old_len + 1)
> WTF how did that happen?
> Well, that's what reviews are for ;)
> > which totally disables the optimized code path.
> >
> > Fixed in r1005437.
> Thanks.
> > That solved, a loop doing a million simple appendbyte calls runs more
> > than twice as fast as calling appendbytes(..., 1).  That's fantastic.
> >
> > But what about that hand-optimization?  I wrote a simple version of the
> > function, and called it 'appendbyte0':
> >
> > svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> > {
> >    if (str->blocksize>  str->len + 1)
> >      {
> >        str->data[str->len++] = byte;
> >        str->data[str->len] = '\0';
> >      }
> >    else
> >      svn_stringbuf_appendbytes(str,&byte, 1);
> > }
> >
> > Here are the results (see full patch attached):
> >
> > Times:  appendbytes appendbyte0 appendbyte  (in ms)
> > run:          89          31          34
> > run:          88          30          35
> > run:          88          31          34
> > run:          88          30          34
> > run:          88          31          34
> > min:          88          30          34
> >
> > This tells me that the hand-optimization is actually harmful and the
> > compiler does a 10% better job by itself.
> >
> > Have I made a mistake?
> My guess is that you might have made two mistakes actually.

Heh, OK.

> First, the number of operations was relatively low - everything
> in the low ms range could be due to secondary effects (and
> still be repeatable).

I can't think of any reason why.  I ran the whole benchmark lots of
times.  Occasionally some of the times were a big chunk longer due to
other system activity.  Normally they were stable.  I also ran it
several times with 10 million ops instead of 1 million, and the results
were exactly 10x longer with the same degree of variability.

> The most important problem would be compiler optimizations
> or lack thereof. Since the numbers are very large (50..100
> ticks per byte, depending on your CPU clock), I would assume
> that you used a normal debug build for testing. In that case,

You're right.  I'll try again, with "--disable-debug -O2".

> the actual number of C statements has a large impact on
> the execution time. See my results below for details.
> > What are the results for your system?
> >
> > (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
> >
> Test code used (10^10 calls, newer re-allocate):
> 
>          int i;
>          unsigned char c;
> 
>          svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);

OK so you're avoiding any calls to the "re-alloc".  My tests included
re-allocs, but were long enough strings that this wasn't much overhead.
Nevertheless I'll avoid re-allocs so we can compare results fairly.

>          for (i = 0; i < 50000000; ++i)
>          {
>              s->len = 0;
>              for (c = 0; c < 200; ++c)
>                  svn_stringbuf_appendbyte (s, c);
>          }
> 
> 
> XEON 55xx / Core i7, hyper-threading on, 3GHz peak
> 64 bits Linux, GCC 4.3.3; ./configure --disable-debug
> 
> (1)  10,7s; IPC = 2.1
> (2)  8,11s; IPC = 1.4
> (3)  2,64s; IPC = 2.4
> (4)  2,43s; IPC = 2.3

> (1) use appendbytes gives 9 instructions in main, 59 in appbytes
> (2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in 
> appbytes
> (3) appendbyte0 (same compiler output as the the non-handtuned code);
>      13 inst. in appbyte, 6 in main
> (4) trunk@HEAD appendbyte; 11 inst. in appbyte, 6 in main
> 
> Core2 2.4GHz, Win32, VS2010
> (not using valgrind to count instructions here; also not testing (2))
> 
> (1)  17,0s release, 20,2s debug
> (3)  10,6s release, 12,2s debug
> (4)  9,7s release, 13,0s debug

With a test harness more similar to yours, and built with
"--disable-debug -O2", here are my relative numbers (but only 1 million
outer loops, compared to your 50 million):

$ svn-c-test subr string 24
Times for 1000000 x 200 bytes, in seconds
  (1)appendbytes (3)appendbyte0 (4)trunk@HEAD (5)macro
run:        7.03        2.06        1.37        0.53
run:        6.62        1.76        1.55        0.53
run:        6.53        1.67        1.44        0.53
run:        6.54        1.60        1.44        0.53
run:        6.52        1.84        1.37        0.53
min:        6.52        1.60        1.37        0.53

This agrees with what you found: the hand-tuned code gives a 10%
improvement.

This also shows another variant, writing the function as a macro,
reported in the column "(5)macro".  This gives a further two-and-a-half
times speed-up in this test.


[...]
> * your numbers seem very large in comparison
>    -> probably something else going on

Can you use my test code or show me yours?  Mine has the harness in
"string-test.c" and the code under test in libsvn_subr - see attachment
- and the default build on my platform loads libsvn_subr as a shared
library.  Maybe yours is completely different.


Other observations:

- Some callers that use very tight loops are in fact simply copying a
string, up as far as some particular end marker.  They may benefit from
being re-written to first determine how much they want to copy and then
call svn_stringbuf_appendbytyes().

- The stringbuf type initially allocates a 1-byte space, and doubles
this every time it re-allocates.  Could we not could allocate at least a
few bytes (4? 8?) as a minimum, without actually consuming more memory
in practice?  That would avoid two or three re-allocs in some usage
patterns.
 
- Julian


Re: Performance optimization - svn_stringbuf_appendbyte()

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.
  On 07.10.2010 16:07, Julian Foad wrote:
>> New Revision: 997203
>>
>> URL: http://svn.apache.org/viewvc?rev=997203&view=rev
>> Log:
>> Merge r985037, r985046, r995507 and r995603 from the performance branch.
>>
>> These changes introduce the svn_stringbuf_appendbyte() function, which has
>> significantly less overhead than svn_stringbuf_appendbytes(), and can be
>> used in a number of places within our codebase.
> Hi Stefan2.
>
> I have been wondering about the actual benefit of such tightly
> hand-optimized code.  Today I got around to giving it a quick spin.
>
> In my first test, it made no difference whatsoever, because the
> optimized code is never executed.  The recent merge r1002898 of r1001413
> from the performance branch introduced a bug:
>
> -  if (str->blocksize>  old_len + 1)
> +  if (str->blocksize<  old_len + 1)
WTF how did that happen?
Well, that's what reviews are for ;)
> which totally disables the optimized code path.
>
> Fixed in r1005437.
Thanks.
> That solved, a loop doing a million simple appendbyte calls runs more
> than twice as fast as calling appendbytes(..., 1).  That's fantastic.
>
> But what about that hand-optimization?  I wrote a simple version of the
> function, and called it 'appendbyte0':
>
> svn_stringbuf_appendbyte0(svn_stringbuf_t *str, char byte)
> {
>    if (str->blocksize>  str->len + 1)
>      {
>        str->data[str->len++] = byte;
>        str->data[str->len] = '\0';
>      }
>    else
>      svn_stringbuf_appendbytes(str,&byte, 1);
> }
>
> Here are the results (see full patch attached):
>
> Times:  appendbytes appendbyte0 appendbyte  (in ms)
> run:          89          31          34
> run:          88          30          35
> run:          88          31          34
> run:          88          30          34
> run:          88          31          34
> min:          88          30          34
>
> This tells me that the hand-optimization is actually harmful and the
> compiler does a 10% better job by itself.
>
> Have I made a mistake?
My guess is that you might have made two mistakes actually.
First, the number of operations was relatively low - everything
in the low ms range could be due to secondary effects (and
still be repeatable).

The most important problem would be compiler optimizations
or lack thereof. Since the numbers are very large (50..100
ticks per byte, depending on your CPU clock), I would assume
that you used a normal debug build for testing. In that case,
the actual number of C statements has a large impact on
the execution time. See my results below for details.
> What are the results for your system?
>
> (I'm using GCC 4.4.1 on an Intel Centrino laptop CPU.)
>
Test code used (10^10 calls, newer re-allocate):

         int i;
         unsigned char c;

         svn_stringbuf_t* s = svn_stringbuf_create_ensure (255, pool);
         for (i = 0; i < 50000000; ++i)
         {
             s->len = 0;
             for (c = 0; c < 200; ++c)
                 svn_stringbuf_appendbyte (s, c);
         }


XEON 55xx / Core i7, hyper-threading on, 3GHz peak
64 bits Linux, GCC 4.3.3; ./configure --disable-debug

(1)  10,7s; IPC = 2.1
(2)  8,11s; IPC = 1.4
(3)  2,64s; IPC = 2.4
(4)  2,43s; IPC = 2.3

(1) use appendbytes gives 9 instructions in main, 59 in appbytes
(2) handle count==1 directly in in appendbytes; 9 inst. in main, 26 in 
appbytes
(3) appendbyte0 (same compiler output as the the non-handtuned code);
     13 inst. in appbyte, 6 in main
(4) trunk@HEAD appendbyte; 11 inst. in appbyte, 6 in main

Core2 2.4GHz, Win32, VS2010
(not using valgrind to count instructions here; also not testing (2))

(1)  17,0s release, 20,2s debug
(3)  10,6s release, 12,2s debug
(4)  9,7s release, 13,0s debug

That shows the following:
* for debug code, appendbyte0 is faster
   (because debug code stores all variables on stack and the
   hand-tuned code introduced some additional ones)
* the ABI has significant effect (Win32: all parameters put on stack;
   Lin64: all fit into registers)
* shorter API (function signature) speeds up the *caller* significantly
* as long as small functions don't need to r/w to stack they can
   be very fast (64 bit gives more registers, parameters passed in
   registers eliminate stack frame ops, simple functions don't need
   more than the available scratch registers)
* your numbers seem very large in comparison
   -> probably something else going on

-- Stefan^2