You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@alice-dsl.de> on 2010/04/25 14:51:37 UTC

[PATCH] Saving a few cycles, part 1/2

Hi devs,

further reducing my backlog of patches sitting in my
working copy, this and the next patch optimize code
locally - shaving off cycles here and there. The net
effect is somewhere between 3 and 10 percent
for repository access (ls, export, etc.).

In this patch, I eliminated calls to memcpy for small
copies as they are particularly expensive in the MS CRT.

-- Stefan^2.

[[[
Eliminate memcpy from critical paths during reading
data from the repository.

* subversion/libsvn_delta/text_delta.c
  (svn_txdelta_apply_instructions): replace memcpy
  for small amounts of data; optimize overlapping
  copies; optimize 'buffer full' detection

* subversion/libsvn_subr/svn_string.c
  (svn_stringbuf_appendbytes): replace memcpy
  with specialized code when adding single chars.
]]]

RE: [PATCH] Saving a few cycles, part 1/2

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Stefan Fuhrmann [mailto:stefanfuhrmann@alice-dsl.de]
> Sent: dinsdag 27 april 2010 1:10
> To: Bert Huijben; dev@subversion.apache.org
> Subject: Re: [PATCH] Saving a few cycles, part 1/2
> 
> Bert Huijben wrote:
> >
> >> -----Original Message-----
> >> From: Stefan Fuhrmann [mailto:stefanfuhrmann@alice-dsl.de]
> >> In this patch, I eliminated calls to memcpy for small copies as they
are
> >> particularly expensive in the MS CRT.
> >>
> >
> > Which CRT did you use for these measurements? (2005, 2008, 2010, Debug
> vs
> > Release and DLL vs Static?). Which compiler version? (Standard/Express
or
> > Professional+). (I assume you use the normal Subversion build using .sln
> > files and not the TortoiseSVN scripts? Did you use the shared library
builds
> > or a static build)?
> >
> VSTS2008 Developer Edition. Release build (am I an Amateur?!)
> TSVN build scripts which set /Ox (global opt, intrinsics, omit frame
> pointers, ...)
> > Did you try enabling the intrinsincs for this method instead of using a
> > handcoded copy?
> >
> <mode="eductional prick">
> Yes, but it does not help in this case: memset will use intrinsics
> only for short (<48 bytes on x86) _fixed-size_  buffers. memcpy
> will use intrinsics for _fixed-size_ buffers only, but seemingly with
> no size limit.

But did you try a non-shared library build.

If you use the C runtime as a shared library things like using fastcall
instead of __cdecl or full program optimization don't matter as you don't
change msvcr90.dll (or a later version) in your build. The overhead of
calling a function in a DLL is probably bigger than the thing you are trying
to accomplish by handcoding your memcpy().

In your first mail you said " In this patch, I eliminated calls to memcpy
for small
copies as they are particularly expensive in the MS CRT."

Did you compare it to other toolchains?
And did you compare it to a completely static build without referencing to
msvcr90.dll?

Where these functions on other toolchains compiled into the binary or also
in an external dll?

When comparing 7 byte buffer copies, things like doing an indirect call for
a library function have a much bigger impact than shaving off a few
assembler instructions of the loop itself, so maybe just passing /MT instead
of /MD to the compiler makes the same difference. (It will certainly help on
the full optimization)

Looking at the TortoiseSVN build and my local set of binaries I see that it
uses MSVCR90.DLL from most of its libraries, and it at least uses memcpy()
from its dll in a few code paths.

	Bert

Re: [PATCH] Saving a few cycles, part 1/2

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.

Bert Huijben wrote:
>   
>> -----Original Message-----
>> From: Stefan Fuhrmann [mailto:stefanfuhrmann@alice-dsl.de]
>> In this patch, I eliminated calls to memcpy for small copies as they are
>> particularly expensive in the MS CRT.
>>     
>
> Which CRT did you use for these measurements? (2005, 2008, 2010, Debug vs
> Release and DLL vs Static?). Which compiler version? (Standard/Express or
> Professional+). (I assume you use the normal Subversion build using .sln
> files and not the TortoiseSVN scripts? Did you use the shared library builds
> or a static build)?
>   
VSTS2008 Developer Edition. Release build (am I an Amateur?!)
TSVN build scripts which set /Ox (global opt, intrinsics, omit frame
pointers, ...)
> Did you try enabling the intrinsincs for this method instead of using a
> handcoded copy?
>   
<mode="eductional prick">
Yes, but it does not help in this case: memset will use intrinsics
only for short (<48 bytes on x86) _fixed-size_  buffers. memcpy
will use intrinsics for _fixed-size_ buffers only, but seemingly with
no size limit.
> I'm pretty sure that the modern variants enable inlined optimized assembly
> for memcpy in release mode (and certainly if you ask for that using the
> right #pragma), but enabling profiling on the linker or advanced debugging
> options will probably disable all that. I would be extremely surprised if
> this optimized assembler is measurable slower than the CRT on other OSs as
> copying memory is a pretty elementary operation.
> (But I'm pretty certain that the debug CRT with variable usage and memory
> usage tracking is slower than the other CRT's.. But that is why we deliver
> applications in release mode)
>   
The problem is that non-trivial intrinsics are hard to implement
in a compiler. Non-trivial means including dynamically optimizing
for short buffers. It's the variable length that will make it hard:

* a primitive copy loop takes 2 ticks / iteration (here: byte)
  + 1 tick setup time
* a "rep movsb" is only fast for <10 bytes but still slower
  than a primitive loop (otherwise: 50 ticks setup)
* "rep movsw/d/q" has > 10 bytes / tick but > 15 ticks
  amortized setup time.
* conditional jumps may be hard to predict an increase
  the load of branch predictor

Without feedback optimization, a compiler cannot know
whether it should optimize for short, mid-sized or long
sequences. A generic code trying to accommodate all three
cases would be very long and still perform poorly on either
short or mid-sized sequences or both.

In our case, the system will have two typical load scenarios:

* hard-to-compress data or data with high compressibility
  -> long copy sequences -> use memcpy
* mildly compressible data -> many very short sequences
  -> manual copy

What scenario applies to the current run is an intrinsic
property of the file being processed, i.e. the condition
switching between them is highly predictable for most
loads.

Since the overhead of calling a __cdecl function alone
is copying 3 values, we save considerable time using a
manual byte-wise copy for short sequences.
> In my performance measurements the only case where I saw a memcpy turn up
> was in the apr pool functions. (This was while using a profiler that didn't
> touch the code (just timed measure points) and tracing on optimized release
> mode binaries)
>   
I use sampling profiling only (just as you did). The problem
with the MS CRT is that the individual functions are rarely
visible in the result. Instead, they get reported cumulatively
under MSVCRT.DLL. The reason behind this is that the
named function is a simple JMP instruction to the actual code.
</mode>

-- Stefan^2.

RE: [PATCH] Saving a few cycles, part 1/2

Posted by Bert Huijben <be...@vmoo.com>.


> -----Original Message-----
> From: Stefan Fuhrmann [mailto:stefanfuhrmann@alice-dsl.de]
> Sent: zondag 25 april 2010 16:52
> To: dev@subversion.apache.org
> Subject: [PATCH] Saving a few cycles, part 1/2
> 
> Hi devs,
> 
> further reducing my backlog of patches sitting in my working copy, this
and
> the next patch optimize code locally - shaving off cycles here and there.
The
> net effect is somewhere between 3 and 10 percent for repository access
(ls,
> export, etc.).
> 
> In this patch, I eliminated calls to memcpy for small copies as they are
> particularly expensive in the MS CRT.

Which CRT did you use for these measurements? (2005, 2008, 2010, Debug vs
Release and DLL vs Static?). Which compiler version? (Standard/Express or
Professional+). (I assume you use the normal Subversion build using .sln
files and not the TortoiseSVN scripts? Did you use the shared library builds
or a static build)?

Did you try enabling the intrinsincs for this method instead of using a
handcoded copy?

I'm pretty sure that the modern variants enable inlined optimized assembly
for memcpy in release mode (and certainly if you ask for that using the
right #pragma), but enabling profiling on the linker or advanced debugging
options will probably disable all that. I would be extremely surprised if
this optimized assembler is measurable slower than the CRT on other OSs as
copying memory is a pretty elementary operation.
(But I'm pretty certain that the debug CRT with variable usage and memory
usage tracking is slower than the other CRT's.. But that is why we deliver
applications in release mode)

In my performance measurements the only case where I saw a memcpy turn up
was in the apr pool functions. (This was while using a profiler that didn't
touch the code (just timed measure points) and tracing on optimized release
mode binaries)

	Bert
> 
> -- Stefan^2.
> 
> [[[
> Eliminate memcpy from critical paths during reading data from the
> repository.
> 
> * subversion/libsvn_delta/text_delta.c
>   (svn_txdelta_apply_instructions): replace memcpy
>   for small amounts of data; optimize overlapping
>   copies; optimize 'buffer full' detection
> 
> * subversion/libsvn_subr/svn_string.c
>   (svn_stringbuf_appendbytes): replace memcpy
>   with specialized code when adding single chars.
> ]]]

RE: [PATCH] Saving a few cycles, part 1/2

Posted by Bert Huijben <be...@qqmail.nl>.


> -----Original Message-----
> From: Stefan Fuhrmann [mailto:stefanfuhrmann@alice-dsl.de]
> Sent: zondag 25 april 2010 16:52
> To: dev@subversion.apache.org
> Subject: [PATCH] Saving a few cycles, part 1/2
> 
> Hi devs,
> 
> further reducing my backlog of patches sitting in my
> working copy, this and the next patch optimize code
> locally - shaving off cycles here and there. The net
> effect is somewhere between 3 and 10 percent
> for repository access (ls, export, etc.).
> 
> In this patch, I eliminated calls to memcpy for small
> copies as they are particularly expensive in the MS CRT.

If you don't use the intrinsincs (or IA-64), this is the sourcecode from the
MS CRT in VS2010.
(See %PROGAMFILES%\Microsoft Visual Studio 10.0\VC\crt\src\memcpy.c)

void * __cdecl memcpy (
        void * dst,
        const void * src,
        size_t count
        )
{
        void * ret = dst;

        /*
         * copy from lower addresses to higher addresses
         */
        while (count--) {
                *(char *)dst = *(char *)src;
                dst = (char *)dst + 1;
                src = (char *)src + 1;
        }

        return(ret);
}

Where would this be slower than the code you tried to replace it with?

Are you sure the overhead isn't in calling the C runtime in a separate DLL?

	Bert

Re: [PATCH] Saving a few cycles, part 1/2

Posted by Stefan Fuhrmann <st...@alice-dsl.de>.

Philip Martin wrote:
> Stefan Fuhrmann <st...@alice-dsl.de> writes:
>
>   
>> further reducing my backlog of patches sitting in my
>> working copy, this and the next patch optimize code
>> locally - shaving off cycles here and there. The net
>> effect is somewhere between 3 and 10 percent
>> for repository access (ls, export, etc.).
>>
>> In this patch, I eliminated calls to memcpy for small
>> copies as they are particularly expensive in the MS CRT.
>>     
>
> For gcc (on Linux at least) memcpy is automatically inlined for small
> copies.  Obscuring the memcpy could well result in worse code.
>   
Variable-sized memcpy is hard to inline. The MS
compiler, at any rate, won't use intrinsics here.
But a "static inline fast_memcpy" function would
probably be the best solution here.
>   
>> @@ -594,31 +636,46 @@
>>               semantics aren't guaranteed for overlapping memory areas,
>>               and target copies are allowed to overlap to generate
>>               repeated data.  */
>> -          assert(op->offset < tpos);
>> -          for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
>> -            tbuf[j++] = tbuf[i];
>>     
>
> Why are we not using memmove?
>   
In LZW-style (de-)compression, overlapping copies
are used used to generate repeating patterns. This is
exactly what memmove would prevent (I tried this
just out of curiosity and it failed).
>   
>> --- subversion/libsvn_subr/svn_string.c	(revision 937673)
>> +++ subversion/libsvn_subr/svn_string.c	(working copy)
>> @@ -391,20 +391,34 @@
>>    apr_size_t total_len;
>>    void *start_address;
>>  
>> -  total_len = str->len + count;  /* total size needed */
>> +  /* This function is frequently called by svn_stream_readline
>> +     adding one char at a time. Eliminate the 'evil' memcpy in
>> +     that case unless the buffer must be resized. */
>>  
>>     
>
> If we use it a lot then optimising for a single byte might be
> worthwhile.  Perhaps we should write svn_stringbuf_appendbyte?
>   
It is used frequently but maybe only called from one
other function. I will write a separate function if I find
another place adding single chars.

Re: [PATCH] Saving a few cycles, part 1/2

Posted by John Szakmeister <jo...@szakmeister.net>.

On Mon, Apr 26, 2010 at 5:23 AM, Philip Martin
<ph...@wandisco.com> wrote:
> Stefan Fuhrmann <st...@alice-dsl.de> writes:
>
>> further reducing my backlog of patches sitting in my
>> working copy, this and the next patch optimize code
>> locally - shaving off cycles here and there. The net
>> effect is somewhere between 3 and 10 percent
>> for repository access (ls, export, etc.).
>>
>> In this patch, I eliminated calls to memcpy for small
>> copies as they are particularly expensive in the MS CRT.
>
> For gcc (on Linux at least) memcpy is automatically inlined for small
> copies.  Obscuring the memcpy could well result in worse code.

And I believe we're enabling intrinsics in the Windows build... so
it's being inlined there too.  I agree, I don't like obscuring
memcpy() either.

-John

Re: [PATCH] Saving a few cycles, part 1/2

Posted by Philip Martin <ph...@wandisco.com>.

Stefan Fuhrmann <st...@alice-dsl.de> writes:

> further reducing my backlog of patches sitting in my
> working copy, this and the next patch optimize code
> locally - shaving off cycles here and there. The net
> effect is somewhere between 3 and 10 percent
> for repository access (ls, export, etc.).
>
> In this patch, I eliminated calls to memcpy for small
> copies as they are particularly expensive in the MS CRT.

For gcc (on Linux at least) memcpy is automatically inlined for small
copies.  Obscuring the memcpy could well result in worse code.

> @@ -594,31 +636,46 @@
>               semantics aren't guaranteed for overlapping memory areas,
>               and target copies are allowed to overlap to generate
>               repeated data.  */
> -          assert(op->offset < tpos);
> -          for (i = op->offset, j = tpos; i < op->offset + buf_len; i++)
> -            tbuf[j++] = tbuf[i];

Why are we not using memmove?

> --- subversion/libsvn_subr/svn_string.c	(revision 937673)
> +++ subversion/libsvn_subr/svn_string.c	(working copy)
> @@ -391,20 +391,34 @@
>    apr_size_t total_len;
>    void *start_address;
>  
> -  total_len = str->len + count;  /* total size needed */
> +  /* This function is frequently called by svn_stream_readline
> +     adding one char at a time. Eliminate the 'evil' memcpy in
> +     that case unless the buffer must be resized. */
>  

If we use it a lot then optimising for a single byte might be
worthwhile.  Perhaps we should write svn_stringbuf_appendbyte?

-- 
Philip

Re: [PATCH] Saving a few cycles, part 1/2

Posted by Julian Foad <ju...@wandisco.com>.

On Sun, 2010-04-25, Stefan Fuhrmann wrote:
> In this patch, I eliminated calls to memcpy for small
> copies as they are particularly expensive in the MS CRT.

If we do decide it is worth avoiding calls to 'memcpy' for small sizes,
please don't put the expanded code in-line at the points of use, because
that obscures the functional purpose of the code.  Instead, put the
optimised version in a suitably named local function ("fast_memcpy"?) so
that the main function remains easy to read.

- Julian

> [[[
> Eliminate memcpy from critical paths during reading
> data from the repository.
[...]