You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@alice-dsl.de> on 2010/05/01 16:38:14 UTC

[PATCH v2] Saving a few cycles, part 2/3

Hi devs,

I reworked the patch set according to the feedback I got
here on this list. This is what changed in the first part:

* split original part 1/2 into 1/3 (append char) and 2/3
 (apply text delta).
* move memcpy optimization to a new fast_memcpy
* add rationales to the commentary

-- Stefan^2.

[[[
svn_txdelta_apply_instructions is relatively slow for long
instruction sequences copying small pieces of data. This
seems to be particularly visible in non-packed FSFS
repositories.

This patch extracts invariants out from the 'for' loop,
optimizes overlapping copies as well as small data copy
runtime. As a result, the performance is about doubled
and the total svn export runtime gets reduced by 2 %.

* subversion/libsvn_delta/text_delta.c
  (fast_memcpy): new function
  (svn_txdelta_apply_instructions): reduce loop overhead,
  use fast_memcpy, optimize overlapped copy code

patch by stefanfuhrmann < at > alice-dsl.de
]]]

Re: [PATCH v2] Saving a few cycles, part 2/3

Posted by Greg Stein <gs...@gmail.com>.

On Thu, May 13, 2010 at 20:00, Julian Foad <ju...@wandisco.com> wrote:
>...
> The fast_memcpy() optimization raised an eyebrow: one developer (Greg
> Stein) mentioned to me that optimizing standard memcpy felt wrong in
> principle.  I don't know what was behind his feeling.  I guess one kind
> of concern that could be considered against it is that newer compilers
> and hardware in a few years could possibly negate the optimization, yet

Not even in a few years. I figured it to be "now".

But with that said, Stefan makes a very good point: the memcpy() that
compilers will use are optimized for a different workload than we are
using here. Thus, it seems fair to craft one tuned for *this*
workload.

>...

The couple things that stood out in my quick look at the patch have
already been responded to by Julian, so I'll just wait for the thing
to be committed and do a review then. It seems that, with Julian's
input and re-roll of the patch, it is ready for commit.

Cheers,
-g

Re: [PATCH v2] Saving a few cycles, part 2/3

Posted by Greg Stein <gs...@gmail.com>.

On Tue, May 18, 2010 at 07:25, Stefan Fuhrmann
<st...@alice-dsl.de> wrote:
>...
> It takes a hundred arrows to shoot an elephant. Although some may
> miss, they all need to be fired. Currently we inspect every single
> arrow, discus its potential penetration depth, smoothness of the
> shaft and likelihood of hitting another arrow instead of the elephant.
>
> Let's hope in the mean while the elephant did not die of old age ;)

LOL... I think you make a fair point. I believe it is because we tend
to prefer clarity over "tricky" code, even if that costs a bit of
performance.

... and that is what you are now rectifying. :-)

(let's just hope that we can get the perf, and retain some measure of clarity)

Cheers,
-g

Re: [PATCH v2] Saving a few cycles, part 2/3

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.

Julian Foad wrote:
> Greg - do you want to respond re. 'fast_memcpy()'?
>
> On Sat, 2010-05-01, Stefan Fuhrmann wrote:  
>> [[[
>> svn_txdelta_apply_instructions is relatively slow for long
>> instruction sequences copying small pieces of data. This
>> seems to be particularly visible in non-packed FSFS
>> repositories.
>>
>> This patch extracts invariants out from the 'for' loop,
>> optimizes overlapping copies as well as small data copy
>> runtime. As a result, the performance is about doubled
>> and the total svn export runtime gets reduced by 2 %.
>>
>> * subversion/libsvn_delta/text_delta.c
>>   (fast_memcpy): new function
>>   (svn_txdelta_apply_instructions): reduce loop overhead,
>>   use fast_memcpy, optimize overlapped copy code
>>
>> patch by stefanfuhrmann < at > alice-dsl.de
>> ]]]
>>     
>
> Hi Stefan.
>
> I think this patch is mostly fine.  I have a few minor comments (in
> line, below) which I want to get out to you.
>
> The fast_memcpy() optimization raised an eyebrow: one developer (Greg
> Stein) mentioned to me that optimizing standard memcpy felt wrong in
> principle.  I don't know what was behind his feeling.  I guess one kind
> of concern that could be considered against it is that newer compilers
> and hardware in a few years could possibly negate the optimization, yet
> that code is unlikely to be re-assessed in the same time frame, so it
> could end up being 'stale' code.  Is that a big concern?  No, a small
> one.  I totally understand that you get a significant improvement in
> practice, right now, at least in your tests, and perhaps in other
> environments.
>   
As I tried to explain in http://svn.haxx.se/dev/archive-2010-04/0596.shtml,
memcpy is very hard to optimize for *all* kinds of possible uses.
MS' implementation is a 600 line assembly fun read plus some
special-case code that is not public. That much to the prospect
of supporting variable-sized in intrinsic code.

The main use-cases for memcpy are: string or string-like operations
(-> minimize latency for copies of 10 .. 100 bytes) and copying large
blobs (-> optimize throughput in the "else" case).

Very short copies suffer from call latency. A compiler might generate
code that is similar to fast_memcpy but that would hurt the latency
and to some lesser degree the throughput  for string operations.

If compilers in future should go to great length trying to solve that
problem or to shift priorities, fast_memcpy would still be roughly
on par because
* it handles *our* critical case very well
* special case checks in memcpy checking for very short buffers
 would be 100% predictable -> close to zero overhead for bypassing
 the code for the duplicate implementation
> My thought is that if we have evidence of a benefit and no current or
> predicted harm, we should take it.  That code is at least short and
> clear.
>
>   
Like much of the svndelta code, this code is mainly used for
non-packed repositories.Packed repositories seem to have
much larger delta chunks. Profiling data, I used during optimization:

a) flat (1.4 format) repository, total runtime ~4850 samples
- non-optimized: 316 samples total in function, thereof 107 in memcpy
- non-optimized: 245 samples total in function, thereof 64 in memcpy
a) packed 1.6 repository, total runtime ~3700 samples
- non-optimized: 100 samples total in function, thereof 95 in memcpy
- non-optimized: 90 samples total in function, thereof 85 in memcpy

Please note that individual sample counts will fluctuate by up to 5
percent or 20 samples (whichever is larger). So, 1.5% savings
in case a), no significant change in case b).

The thing with the whole optimization business is that SVN already
is quite good algorithmically (details to be laid out when posting the
performance analysis). However, there are still significant gains
to made but not with a single shot.

It takes a hundred arrows to shoot an elephant. Although some may
miss, they all need to be fired. Currently we inspect every single
arrow, discus its potential penetration depth, smoothness of the
shaft and likelihood of hitting another arrow instead of the elephant.

Let's hope in the mean while the elephant did not die of old age ;)

> If anyone else has thoughts on that, let's hear them.
>
> (It occurs to me now to wonder whether the 'delta' format is really
> working efficiently if it is requesting a high frequency of copies as
> small as 1 or 2 bytes.  That just seems odd.  I haven't looked into it.)
>   
I haven't created a histogram for these calls but from my work
on zlib, I would expect 3 bytes to be the lower limit.
> plain text document attachment (PerformanceApplyDeltas.patch):  
>> Index: subversion/libsvn_delta/text_delta.c
>> ===================================================================
>> --- subversion/libsvn_delta/text_delta.c    (revision 939976)
>> +++ subversion/libsvn_delta/text_delta.c    (working copy)
>> @@ -564,61 +591,109 @@
>>    return SVN_NO_ERROR;
>>  }
>>  
>> +/* memcpy() is hard to tune for a wide range of buffer lengths.
>> + * Therefore, it is often tuned for high throughput on large
>> + * buffers and relatively low latency for mid-sized buffers
>> + * (tens of bytes). However, the overhead for very small buffers
>> + * (<10 bytes) is still high. Even passing the parameters, for
>> + * instance, may take as long as copying 3 bytes.
>> + *
>> + * Because short copy sequences seem to be a common case in
>> + * non-packed FSFS repositories, we copy them directly. Larger
>> + * buffer sizes aren't hurt measurably by the exta 'if' clause.
>> + */
>>     
>
> Please document that it returns (target + len), since that is different
> from the behaviour of standard memcpy().
>
>   
Will do - in particular as it is part of the optimization.
>> +static APR_INLINE char*
>> +fast_memcpy(char* target, const char* source, apr_size_t len)
>>     
>
> Style nit: We write pointers as "char *target" rather than "char*
> target".
>   
o.k.
>  
>> +{
>> +  if (len > 7)
>> +    {
>> +      memcpy(target, source, len);
>> +      target += len;
>> +    }
>> +  else
>> +    {
>> +      /* memcpy is not exactly fast for small block sizes.
>> +         Since they are common, let's run optimized code for them. */
>> +      const char* end = source + len;
>> +      for (; source != end; source++)
>> +        *(target++) = *source;
>> +    }
>>  
>> +  return target;
>> +}
>> +
>>  void
>>  svn_txdelta_apply_instructions(svn_txdelta_window_t *window,
>>                                 const char *sbuf, char *tbuf,
>>                                 apr_size_t *tlen)
>>  {
>> -  const svn_txdelta_op_t *op;
>> -  apr_size_t i, j, tpos = 0;
>> +  /* The main loop is run quite often.
>> +   * Pre-calculate as much data as possible.
>> +   */
>> +  const svn_txdelta_op_t *op, *last_op = window->ops + window->num_ops;
>> +  apr_size_t to_fill = *tlen < window->tview_len ? *tlen : 
>> window->tview_len;
>> +  apr_size_t left = to_fill;
>> +  const char* end, *source;
>> +  char *target = tbuf;
>>  
>> -  for (op = window->ops; op < window->ops + window->num_ops; op++)
>> +  for (op = window->ops; left > 0; op++)
>>     
>
> I haven't studied this enough to see why "left > 0" is equivalent to "op
> < window->ops + window->num_ops".  Can't we use "op < last_op" to make
> it obvious?  (Generally a loop condition is more obvious when it refers
> to the loop counter.)
>
>   
I will add a longer comment for that in the code.
>> {
>> -      const apr_size_t buf_len = (op->length < *tlen - tpos
>> -                                  ? op->length : *tlen - tpos);
>> +      const apr_size_t buf_len = op->length < left ? op->length : left;
>> +      left -= buf_len;
>>  
>>        /* Check some invariants common to all instructions.  */
>> -      assert(tpos + op->length <= window->tview_len);
>> +      assert(target - tbuf + op->length <= window->tview_len);
>>  
>>        switch (op->action_code)
>>          {
>>          case svn_txdelta_source:
>>            /* Copy from source area.  */
>>            assert(op->offset + op->length <= window->sview_len);
>> -          memcpy(tbuf + tpos, sbuf + op->offset, buf_len);
>> +          target = fast_memcpy(target, sbuf + op->offset, buf_len);
>>            break;
>>  
>>          case svn_txdelta_target:
>> -          /* Copy from target area.  Don't use memcpy() since its
>> -             semantics aren't guaranteed for overlapping memory areas,
>> -             and target copies are allowed to overlap to generate
>> -             repeated data.  */
>> -          assert(op->offset < tpos);
>> -          for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
>> -            tbuf[j++] = tbuf[i];
>> +          /* Copy from target area.  We can't use memcpy() or the 
>> like +             since we need a specific semantics for overlapping 
>> copies: +             they must result in repeating patterns.
>> +           */
>> +          assert(op->offset < target - *tbuf);
>> +          source = tbuf + op->offset;
>> +          end = tbuf + op->offset + buf_len;
>> +
>> +          /* Copy in larger chunks if source and target don't overlap
>> +           * closer than the size of the chunks (or don't overlap at 
>> all).
>> +           * Use the natural machine word as chunk size
>> +           * (for some architectures size_t is even a bit larger).
>> +           */
>> +          if (end + sizeof(apr_size_t) <= target)
>> +            for (; source + sizeof (apr_size_t) <= end; 
>> +                   source += sizeof (apr_size_t), 
>> +                   target += sizeof (apr_size_t))
>> +              *(apr_size_t*)(target) = *(apr_size_t*)(source);
>> +
>> +          /* Copy trailing bytes */
>> +          for (; source != end; source++)
>> +            *(target++) = *source;
>>     
>
> I would prefer to see this "overlapping copy" code factored out as a
> subroutine, analogous to fast_memcpy.
>   
Can do.
>  
>> break;
>>  
>>          case svn_txdelta_new:
>>            /* Copy from window new area.  */
>>            assert(op->offset + op->length <= window->new_data->len);
>> -          memcpy(tbuf + tpos,
>> -                 window->new_data->data + op->offset,
>> -                 buf_len);
>> +          target = fast_memcpy(target, 
>> +                               window->new_data->data + op->offset,
>> +                               buf_len);
>>            break;
>>  
>>          default:
>>            assert(!"Invalid delta instruction code");
>>          }
>> -
>> -      tpos += op->length;
>> -      if (tpos >= *tlen)but
>> -        return;                 /* The buffer is full. */
>>      }
>>  
>> -  /* Check that we produced the right amount of data.  */
>> -  assert(tpos == window->tview_len);
>> -  *tlen = tpos;
>> +  /* We always fill the whole target buffer unless we stumble across 
>>     
>
> That comment doesn't look right.  To say "the whole target buffer"
> implies TBUF with TLEN as passed in to this function; but we are writing
> a new value to *TLEN because the actual amount of data produced may be
> less, so that's not always the "whole" buffer.
>
> Instinctively it seems to me that the assertion (with its original
> comment) should be kept, and should be adjusted to suit the new
> variables, though I haven't totally thought it through.
>
>   
Again, needs more commentary.
>> +   * an invalid instruction (caught by debug assertions only).
>> +   */
>> +  *tlen = to_fill;
>>  }
>>     
>
>
> - Julian
>
>   
Thanks for the review.

-- Stefan^2.

Re: [PATCH v2] Saving a few cycles, part 2/3

Posted by Julian Foad <ju...@wandisco.com>.

Greg - do you want to respond re. 'fast_memcpy()'?

On Sat, 2010-05-01, Stefan Fuhrmann wrote: 
> [[[
> svn_txdelta_apply_instructions is relatively slow for long
> instruction sequences copying small pieces of data. This
> seems to be particularly visible in non-packed FSFS
> repositories.
> 
> This patch extracts invariants out from the 'for' loop,
> optimizes overlapping copies as well as small data copy
> runtime. As a result, the performance is about doubled
> and the total svn export runtime gets reduced by 2 %.
> 
> * subversion/libsvn_delta/text_delta.c
>   (fast_memcpy): new function
>   (svn_txdelta_apply_instructions): reduce loop overhead,
>   use fast_memcpy, optimize overlapped copy code
> 
> patch by stefanfuhrmann < at > alice-dsl.de
> ]]]

Hi Stefan.

I think this patch is mostly fine.  I have a few minor comments (in
line, below) which I want to get out to you.

The fast_memcpy() optimization raised an eyebrow: one developer (Greg
Stein) mentioned to me that optimizing standard memcpy felt wrong in
principle.  I don't know what was behind his feeling.  I guess one kind
of concern that could be considered against it is that newer compilers
and hardware in a few years could possibly negate the optimization, yet
that code is unlikely to be re-assessed in the same time frame, so it
could end up being 'stale' code.  Is that a big concern?  No, a small
one.  I totally understand that you get a significant improvement in
practice, right now, at least in your tests, and perhaps in other
environments.

My thought is that if we have evidence of a benefit and no current or
predicted harm, we should take it.  That code is at least short and
clear.

If anyone else has thoughts on that, let's hear them.

(It occurs to me now to wonder whether the 'delta' format is really
working efficiently if it is requesting a high frequency of copies as
small as 1 or 2 bytes.  That just seems odd.  I haven't looked into it.)


plain text document attachment (PerformanceApplyDeltas.patch): 
> Index: subversion/libsvn_delta/text_delta.c
> ===================================================================
> --- subversion/libsvn_delta/text_delta.c	(revision 939976)
> +++ subversion/libsvn_delta/text_delta.c	(working copy)
> @@ -564,61 +591,109 @@
>    return SVN_NO_ERROR;
>  }
>  
> +/* memcpy() is hard to tune for a wide range of buffer lengths.
> + * Therefore, it is often tuned for high throughput on large
> + * buffers and relatively low latency for mid-sized buffers
> + * (tens of bytes). However, the overhead for very small buffers
> + * (<10 bytes) is still high. Even passing the parameters, for
> + * instance, may take as long as copying 3 bytes.
> + *
> + * Because short copy sequences seem to be a common case in
> + * non-packed FSFS repositories, we copy them directly. Larger
> + * buffer sizes aren't hurt measurably by the exta 'if' clause.
> + */

Please document that it returns (target + len), since that is different
from the behaviour of standard memcpy().

> +static APR_INLINE char*
> +fast_memcpy(char* target, const char* source, apr_size_t len)

Style nit: We write pointers as "char *target" rather than "char*
target".

> +{
> +  if (len > 7)
> +    {
> +      memcpy(target, source, len);
> +      target += len;
> +    }
> +  else
> +    {
> +      /* memcpy is not exactly fast for small block sizes.
> +         Since they are common, let's run optimized code for them. */
> +      const char* end = source + len;
> +      for (; source != end; source++)
> +        *(target++) = *source;
> +    }
>  
> +  return target;
> +}
> +
>  void
>  svn_txdelta_apply_instructions(svn_txdelta_window_t *window,
>                                 const char *sbuf, char *tbuf,
>                                 apr_size_t *tlen)
>  {
> -  const svn_txdelta_op_t *op;
> -  apr_size_t i, j, tpos = 0;
> +  /* The main loop is run quite often.
> +   * Pre-calculate as much data as possible.
> +   */
> +  const svn_txdelta_op_t *op, *last_op = window->ops + window->num_ops;
> +  apr_size_t to_fill = *tlen < window->tview_len ? *tlen : window->tview_len;
> +  apr_size_t left = to_fill;
> +  const char* end, *source;
> +  char *target = tbuf;
>  
> -  for (op = window->ops; op < window->ops + window->num_ops; op++)
> +  for (op = window->ops; left > 0; op++)

I haven't studied this enough to see why "left > 0" is equivalent to "op
< window->ops + window->num_ops".  Can't we use "op < last_op" to make
it obvious?  (Generally a loop condition is more obvious when it refers
to the loop counter.)

> {
> -      const apr_size_t buf_len = (op->length < *tlen - tpos
> -                                  ? op->length : *tlen - tpos);
> +      const apr_size_t buf_len = op->length < left ? op->length : left;
> +      left -= buf_len;
>  
>        /* Check some invariants common to all instructions.  */
> -      assert(tpos + op->length <= window->tview_len);
> +      assert(target - tbuf + op->length <= window->tview_len);
>  
>        switch (op->action_code)
>          {
>          case svn_txdelta_source:
>            /* Copy from source area.  */
>            assert(op->offset + op->length <= window->sview_len);
> -          memcpy(tbuf + tpos, sbuf + op->offset, buf_len);
> +          target = fast_memcpy(target, sbuf + op->offset, buf_len);
>            break;
>  
>          case svn_txdelta_target:
> -          /* Copy from target area.  Don't use memcpy() since its
> -             semantics aren't guaranteed for overlapping memory areas,
> -             and target copies are allowed to overlap to generate
> -             repeated data.  */
> -          assert(op->offset < tpos);
> -          for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
> -            tbuf[j++] = tbuf[i];
> +          /* Copy from target area.  We can't use memcpy() or the like 
> +             since we need a specific semantics for overlapping copies: 
> +             they must result in repeating patterns.
> +           */
> +          assert(op->offset < target - *tbuf);
> +          source = tbuf + op->offset;
> +          end = tbuf + op->offset + buf_len;
> +
> +          /* Copy in larger chunks if source and target don't overlap
> +           * closer than the size of the chunks (or don't overlap at all).
> +           * Use the natural machine word as chunk size
> +           * (for some architectures size_t is even a bit larger).
> +           */
> +          if (end + sizeof(apr_size_t) <= target)
> +            for (; source + sizeof (apr_size_t) <= end; 
> +                   source += sizeof (apr_size_t), 
> +                   target += sizeof (apr_size_t))
> +              *(apr_size_t*)(target) = *(apr_size_t*)(source);
> +
> +          /* Copy trailing bytes */
> +          for (; source != end; source++)
> +            *(target++) = *source;

I would prefer to see this "overlapping copy" code factored out as a
subroutine, analogous to fast_memcpy.

> break;
>  
>          case svn_txdelta_new:
>            /* Copy from window new area.  */
>            assert(op->offset + op->length <= window->new_data->len);
> -          memcpy(tbuf + tpos,
> -                 window->new_data->data + op->offset,
> -                 buf_len);
> +          target = fast_memcpy(target, 
> +                               window->new_data->data + op->offset,
> +                               buf_len);
>            break;
>  
>          default:
>            assert(!"Invalid delta instruction code");
>          }
> -
> -      tpos += op->length;
> -      if (tpos >= *tlen)but
> -        return;                 /* The buffer is full. */
>      }
>  
> -  /* Check that we produced the right amount of data.  */
> -  assert(tpos == window->tview_len);
> -  *tlen = tpos;
> +  /* We always fill the whole target buffer unless we stumble across 

That comment doesn't look right.  To say "the whole target buffer"
implies TBUF with TLEN as passed in to this function; but we are writing
a new value to *TLEN because the actual amount of data produced may be
less, so that's not always the "whole" buffer.

Instinctively it seems to me that the assertion (with its original
comment) should be kept, and should be adjusted to suit the new
variables, though I haven't totally thought it through.


> +   * an invalid instruction (caught by debug assertions only).
> +   */
> +  *tlen = to_fill;
>  }


- Julian