You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apr.apache.org by Brian Pane <bp...@pacbell.net> on 2002/06/29 06:52:33 UTC

[PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Brian Pane wrote:

> Bill Stoddard wrote:
> ...

>> The time spent in ap_brigade_puts is
>> suprising...  This particular run indicate that it tool 74355 
>> instructions
>> to serve a keep alive request.
>>
>
> I've seen the brigade_puts overhead in my testing, too...and it
> is definitely surprising, since the code is relatively minimal.
> The only obvious (potential) speedup I can think of would be to
> replace the char-at-a-time loop with a memcpy (with checks to
> make sure it doesn't overflow the available size).  I'll try this
> over the weekend. 

I remembered why memcpy won't help here: we don't know the
length in advance.  But I managed to speed up apr_brigade_puts()
by about 30% in my tests by optimizing its main loop.  Does this
patch reduce the apr_brigade_puts() overhead in your test environment?

--Brian

RE: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Bill Stoddard <bi...@wstoddard.com>.

> >> The time spent in ap_brigade_puts is
> >> suprising...  This particular run indicate that it tool 74355
> >> instructions
> >> to serve a keep alive request.
> >>
> >
> > I've seen the brigade_puts overhead in my testing, too...and it
> > is definitely surprising, since the code is relatively minimal.
> > The only obvious (potential) speedup I can think of would be to
> > replace the char-at-a-time loop with a memcpy (with checks to
> > make sure it doesn't overflow the available size).  I'll try this
> > over the weekend.
>
>
> I remembered why memcpy won't help here: we don't know the
> length in advance.  But I managed to speed up apr_brigade_puts()
> by about 30% in my tests by optimizing its main loop.  Does this
> patch reduce the apr_brigade_puts() overhead in your test environment?
>
> --Brian

Haven't had a chance to profile it yet, but the patch seems to provide a 1
to 2% improvement in throughput serving a 500 byte file out of
mod_mem_cache.

Bill

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Brian Pane <bp...@pacbell.net>.

Cliff Woolley wrote:

>On Sat, 29 Jun 2002, Brian Pane wrote:
>
>  
>
>>I tried this, and it didn't unroll the loop.  That's probably
>>because some of information needed to unroll the loop effectively
>>is unknown to the compiler.
>>    
>>
>
>Hm.  Okay, well, if we're going to do this, can we split it out into a
>separate macro (my_strncpy or something) so it's clear what's going on and
>to avoid cluttering up that function?
>

I have reservations about making it a macro, because that would
confuse debuggers and any profilers that do basic-block or line-level
profiling.

I also don't mind cluttering the function.  The value of
a low-level I/O API like apr_brigade_puts() is that it hides
ugly bufer management details from application code, so that
the application code can stay clean and simple.

--Brian

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 29 Jun 2002, Cliff Woolley wrote:

> Also, isn't it true that your patch now causes the buffer bucket to always
> have 0-7 unused bytes at the end?

Oh duh, nevermind on this point, my fault.  I misread the loop condition.

--Cliff

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Brian Pane <bp...@pacbell.net>.

Cliff Woolley wrote:

>On Sat, 29 Jun 2002, Cliff Woolley wrote:
>
>  
>
>>some way that would allow us to coalesce the writes.
>>    
>>
>
>Alignment issues would kill us here, aren't they?  That sucks.  Grrrr.....
>  
>

We might be able to get some additional improvements by
doing word-at-a-time operations for half of the copy operation:
  - start with the current byte-at-a-time loop
  - as soon as "buf" points to a word-aligned address,
    switch to a mode in which we grab the next sizeof(int)
    bytes from the input string, pack them into an int
    (with ifdef'ed code for big- and little-endian machines),
    and write the int to the target address.

But that might or might not actually be faster (we'd be doing
more instructions in order to do fewer memory writes).  And
it's more complicated than the unrolled-loop code, of course.
So for now, I'll stick with the unrolled-loop implementation,
since it's showing good results in benchmark testing.

--Brian

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Ben Laurie <be...@algroup.co.uk>.

Cliff Woolley wrote:
> On Sat, 29 Jun 2002, Cliff Woolley wrote:
> 
> 
>>some way that would allow us to coalesce the writes.
> 
> 
> Alignment issues would kill us here, aren't they?  That sucks.  Grrrr.....

Depends on the CPU, but if you are feeling energetic you can also align 
the copies. The problem is detecting the EOS.

Cheers,

Ben.

-- 
http://www.apache-ssl.org/ben.html       http://www.thebunker.net/

"There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit." - Robert Woodruff

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 29 Jun 2002, Cliff Woolley wrote:

> some way that would allow us to coalesce the writes.

Alignment issues would kill us here, aren't they?  That sucks.  Grrrr.....

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Sat, 29 Jun 2002, Brian Pane wrote:

> I tried this, and it didn't unroll the loop.  That's probably
> because some of information needed to unroll the loop effectively
> is unknown to the compiler.

Hm.  Okay, well, if we're going to do this, can we split it out into a
separate macro (my_strncpy or something) so it's clear what's going on and
to avoid cluttering up that function?

Also, isn't it true that your patch now causes the buffer bucket to always
have 0-7 unused bytes at the end?  I'd have to go back and look more
carefully to be sure, but that was the impression I got from first glance.

I also feel like there *has* to be some better way to check for EOS...
some way that would allow us to coalesce the writes.  But I haven't
figured out what that is yet.  I'll keep thinking about it.

--Cliff

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Brian Pane <bp...@pacbell.net>.

Roy T. Fielding wrote:

> A better optimization might be to reduce the number of calls to
> brigade_puts.  That's how much of 1.3 was improved.


I only know of three ways to reduce the number of apr_brigade_puts()
calls in 2.0:

  * Send fewer fields in the HTTP response header.

  * Or do more buffering prior to calling apr_brigade_puts().
    (This is what 2.0 used to do, and it was even slower, because
    it added yet another layer of memory copying before the socket
    write.)

  * Or produce a separate bucket for each field in the response
    header, and rely on writev to patch them together.
    (This won't work in 2.0; if the number of tiny buckets
    grows too large, core_output_filter() will try to consolidate
    them into a single bucket, with the associated memcpy cost.)

Were you thinking of a different approach from these?

--Brian

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by "Roy T. Fielding" <fi...@apache.org>.

A better optimization might be to reduce the number of calls to
brigade_puts.  That's how much of 1.3 was improved.

....Roy

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Brian Pane <bp...@pacbell.net>.

Cliff Woolley wrote:

>On Fri, 28 Jun 2002, Brian Pane wrote:
>
>  
>
>>I remembered why memcpy won't help here: we don't know the
>>length in advance.  But I managed to speed up apr_brigade_puts()
>>by about 30% in my tests by optimizing its main loop.  Does this
>>patch reduce the apr_brigade_puts() overhead in your test environment?
>>    
>>
>
>Why won't the compiler unroll this loop for you?
>
>gcc -O3 -funroll-loops
>

I tried this, and it didn't unroll the loop.  That's probably
because some of information needed to unroll the loop effectively
is unknown to the compiler.  The condition for continuing this
loop is: 1) not at the end of the input string, and 2) not at
the end of the target bucket.  We have a "lookahead" capability
on the second condition, but not on the first one.  I.e., we know
how many more bytes remain in the target bucket, and thus we can
unroll the loop into blocks of 'n' character-copy operations with
a check for 'n' available bytes of writable buffer space only
once per iteration.  (We also know that, for small values of 'n',
there are almost always more than 'n' bytes left in the bucket,
so that we can actually take advantage of this optimization in
the real world.)  In contrast, the check for end-of-string
can't be unrolled very effectively: there's no way to avoid
having to put a conditional branch in front of every "*buf++=*str++"
operation.  Thus the patch unrolls the loop in a way that reduces
the number of end-of-bucket checks, even though it's impossible to
reduce the number of end-of-string checks.

--Brian

Re: [PATCH] Re: 2.0 performance Re: Breaking something? Now is the time?

Posted by Cliff Woolley <jw...@virginia.edu>.

On Fri, 28 Jun 2002, Brian Pane wrote:

> I remembered why memcpy won't help here: we don't know the
> length in advance.  But I managed to speed up apr_brigade_puts()
> by about 30% in my tests by optimizing its main loop.  Does this
> patch reduce the apr_brigade_puts() overhead in your test environment?

Why won't the compiler unroll this loop for you?

gcc -O3 -funroll-loops

--Cliff