You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Brian Pane <bp...@pacbell.net> on 2002/06/07 17:19:34 UTC

Re: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

On Fri, 2002-06-07 at 01:14, Sascha Schumann wrote:
> > The function php_apache_sapi_ub_write() is inserting a flush bucket
> > after each bucket of data that it adds to the output brigade.
> 
>     The 'ub' stands for unbuffered.. you can avoid that by
>     enabling output buffering in php.ini.

IMHO, that's a design flaw.  Regardless of whether PHP is doing
buffering, it shouldn't break up blocks of static content into
small pieces--especially not as small as 400 bytes.  While it's
certainly valid for PHP to insert a flush bucket right before a
block of embedded code (in case that code takes a long time to
run), breaking static text into 400-byte chunks will usually mean
that it takes *longer* for the content to reach the client, which
probably defeats PHP's motivation for doing the nonbuffered output.
There's code downstream, in the httpd's core_output_filter and
the OS's TCP driver, that can make much better decisions about
when to buffer and when not to buffer.

--Brian

Re: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Aaron Bannert <aa...@clove.org>.

On Fri, Jun 07, 2002 at 06:49:48PM +0200, Sascha Schumann wrote:
>     I doubt that core_output_filter knows the script author's
>     intentions very well.  Anyway, Aaron and Cliff posted a patch
>     which was committed by Sebastian in mid-April which
>     introduced this behaviour.
> 
>     /* Add a Flush bucket to the end of this brigade, so that
>      * the transient buckets above are more likely to make it out
>      * the end of the filter instead of having to be copied into
>      * someone's setaside. */
>     b = apr_bucket_flush_create(ba);
>     APR_BRIGADE_INSERT_TAIL(bb, b);
> 
>     The need for this should be reassessed.  Aaron/Cliff, can you
>     please have a look at this?
> 
>     Please keep in mind that script authors can always use
>     flush() to insert the flush-bucket.

I guess it really only comes down to a question of how the SAPI intended
the ub_write() function to be implemented. If it is indeed supposed to
implement an unbuffered write, then it seems to me that we must pass
a flush bucket down the output filter chain. If that is not the case,
then by all means let's get rid of that flush bucket.

-aaron

Re: [PHP-DEV] RE: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Zeev Suraski <ze...@zend.com>.

At 07:29 PM 6/10/2002, Aaron Bannert wrote:
>On Mon, Jun 10, 2002 at 11:46:46AM +0300, Zeev Suraski wrote:
> > What we need for efficient thread-safe operation is a mechanism like the
> > Win32 heaps - mutexless heaps, that provide malloc and free services on a
> > (preferably) contiguous pre-allocated block of memory.  The question is
> > whether the APR allocators fall into that category:
> >
> > 1.  Can you make them mutexless completely?  I.e., will they never call
> > malloc()?
>
>APR's pools only use malloc() as a portable way to retrieve large
>chunks of heapspace that are never returned. I don't know of any
>other portable way to do this.

There probably isn't.  Win32 heaps take advantage of the virtual memory 
functions (they pre-allocate the max heap size, but commit as necessary), 
and there's probably no portable way of doing that.

>In any case, at some level you will always have a mutex. Either you
>are mapping new segments in to the memory space of the process, or
>you are dealing with freelists in userspace.

I'm not sure if VirtualAlloc(..., MEM_COMMIT) results in a mutex, it will 
probably just not context-switch until it's over.  But you may be right.

> > 3.  As far as I can tell, they don't use a contiguous block of memory,
> > which means more fragmentation...
>
>I'm not sure how contiguity relates to fragmentation. With a pool
>you can do mallocs all day long, slowly accumulating more 8K blocks
>(which may or may not be contiguous). At the end of the pool lifetime
>(let's say, for example, at the end of a request) then those blocks
>are placed on a freelist, and the sub-partitions within those blocks
>are simply forgotten. On the next request, the process starts over again.

The fragmentation-related advantage of using a contiguous block is that PHP 
always ends up freeing ALL of the data in the heap in the end of every 
request.  So, you get to start with the same completely-free, contiguous 
block on every request.  However, if you don't have a contiguous block, and 
you use malloc() calls to satisfy certain allocation requests, any 
persistent malloc() which occurs during the request may end up being in the 
same area as your per-request allocations.  Then, even once you free all of 
the per-request blocks, you may no longer be able to allocate large chunks 
- because some persistent malloc()'s may be stuck in the middle.  Am I 
missing something here?

>I think to properly abstract a memory allocation scheme that can be
>implemented in a way that is optimized for the particular SAPI module,
>we'll have to abstract out a few concepts. This list is not exhaustive,
>but is just a quick sketch based on my understanding of Win32 heaps
>and APR pools:
>
>    - creation (called once per server lifetime)
>    - malloc (called many times per request)
>    - free (called many times per request)
>    - end-of-request (called many times per request)

(happens once per request)

>    - destruction (called once per serve lifetime)
>
>Does this cover all our bases?

There are also some persistent malloc's that happen during a request, and 
do not get freed at the end of the request.  But generally yes.

>  For example, when using pools, the
>free() call would do nothing, and the end-of-request call would simply
>call apr_pool_clear(). Note that this only applies to dynamically
>allocated memory required for the lifetime of a request. For memory
>with longer lifetimes we could make the creation and destruction
>routines more generic.

I know, but that's really not an option :)  This is how PHP/FI 2 used to 
work, and it had horrible memory performance.  We allocate and free *a lot* 
during a request.  We've worked very hard on freeing data as soon as we 
possibly can, so using the pool approach of cleaning everything at the 
end  of a request will reduce our memory performance radically.  What we 
currently have is a memory allocator that is quite suitable for the job - 
it supports malloc and free and it caches small blocks for reuse.  Still, 
under Windows, moving to a heap made a big difference - it eliminated the 
mutex overhead and reduced fragmentation.  The solution under UNIX may very 
well be increasing the block cache to work with larger blocks as well, and 
a larger number of blocks of each size - but this will definitely increase 
fragmentation big time :(

Zeev

Re: [PHP-DEV] RE: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Aaron Bannert <aa...@clove.org>.

On Mon, Jun 10, 2002 at 09:29:58AM -0700, Aaron Bannert wrote:
>    - creation (called once per server lifetime)
>    - malloc (called many times per request)
>    - free (called many times per request)
>    - end-of-request (called many times per request)

(Whoops, that should have been -- called once at the end of the request)

>    - destruction (called once per serve lifetime)

-a

Re: [PHP-DEV] RE: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Aaron Bannert <aa...@clove.org>.

On Mon, Jun 10, 2002 at 11:46:46AM +0300, Zeev Suraski wrote:
> What we need for efficient thread-safe operation is a mechanism like the 
> Win32 heaps - mutexless heaps, that provide malloc and free services on a 
> (preferably) contiguous pre-allocated block of memory.  The question is 
> whether the APR allocators fall into that category:
> 
> 1.  Can you make them mutexless completely?  I.e., will they never call 
> malloc()?

APR's pools only use malloc() as a portable way to retrieve large
chunks of heapspace that are never returned. I don't know of any
other portable way to do this.

In any case, at some level you will always have a mutex. Either you
are mapping new segments in to the memory space of the process, or
you are dealing with freelists in userspace. APR pools take the
approach that by doing more up-front segment mapping and delayed
freeing of chunks, we avoid many of the mutexes and overhead of
freelist management. It's still got to happen somewhere though.

> 2.  They definitely do provide alloc/free services, we're ok there

Pretty much, but it's not exactly the same. I'll outline some thoughts
on a potential memory allocation abstraction that could be implemented
w/ apr_pools or Win32 heaps below...

> 3.  As far as I can tell, they don't use a contiguous block of memory, 
> which means more fragmentation...

I'm not sure how contiguity relates to fragmentation. With a pool
you can do mallocs all day long, slowly accumulating more 8K blocks
(which may or may not be contiguous). At the end of the pool lifetime
(let's say, for example, at the end of a request) then those blocks
are placed on a freelist, and the sub-partitions within those blocks
are simply forgotten. On the next request, the process starts over again.

I think to properly abstract a memory allocation scheme that can be
implemented in a way that is optimized for the particular SAPI module,
we'll have to abstract out a few concepts. This list is not exhaustive,
but is just a quick sketch based on my understanding of Win32 heaps
and APR pools:

   - creation (called once per server lifetime)
   - malloc (called many times per request)
   - free (called many times per request)
   - end-of-request (called many times per request)
   - destruction (called once per serve lifetime)

Does this cover all our bases? For example, when using pools, the
free() call would do nothing, and the end-of-request call would simply
call apr_pool_clear(). Note that this only applies to dynamically
allocated memory required for the lifetime of a request. For memory
with longer lifetimes we could make the creation and destruction
routines more generic.

-aaron

Re: [PHP-DEV] RE: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Zeev Suraski <ze...@zend.com>.

At 09:03 AM 6/10/2002, Sander Striker wrote:
>Why is PHP even using its own memory allocation scheme?  It would be much
>easier to just use pools and point out where it doesn't work for you.

Because we don't want it depend on any underlying services which aren't 
available in all servers.  We can say that in general, the no-free 
allocation scheme doesn't work at all with PHP, so the pool approach cannot 
be used.  Even if we could use it, though, all of the services of the 
memory allocator are still necessary at the PHP level so we can provide 
them outside the scope of Apache 2.

What we need for efficient thread-safe operation is a mechanism like the 
Win32 heaps - mutexless heaps, that provide malloc and free services on a 
(preferably) contiguous pre-allocated block of memory.  The question is 
whether the APR allocators fall into that category:

1.  Can you make them mutexless completely?  I.e., will they never call 
malloc()?
2.  They definitely do provide alloc/free services, we're ok there
3.  As far as I can tell, they don't use a contiguous block of memory, 
which means more fragmentation...

Zeev

Zeev

RE: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Sander Striker <st...@apache.org>.

> From: Andi Gutmans [mailto:andi@zend.com]
> Sent: 08 June 2002 13:59

> On Fri, 7 Jun 2002, Ryan Bloom wrote:
> 
>> On Fri, 7 Jun 2002, Brian Pane wrote:
>> 
>>> On Fri, 2002-06-07 at 01:14, Sascha Schumann wrote:
>>>>> The function php_apache_sapi_ub_write() is inserting a flush bucket
>>>>> after each bucket of data that it adds to the output brigade.
>>>> 
>>>>     The 'ub' stands for unbuffered.. you can avoid that by
>>>>     enabling output buffering in php.ini.
>>> 
>>> IMHO, that's a design flaw.  Regardless of whether PHP is doing
>>> buffering, it shouldn't break up blocks of static content into
>>> small pieces--especially not as small as 400 bytes.  While it's
>>> certainly valid for PHP to insert a flush bucket right before a
>>> block of embedded code (in case that code takes a long time to
>>> run), breaking static text into 400-byte chunks will usually mean
>>> that it takes *longer* for the content to reach the client, which
>>> probably defeats PHP's motivation for doing the nonbuffered output.
>>> There's code downstream, in the httpd's core_output_filter and
>>> the OS's TCP driver, that can make much better decisions about
>>> when to buffer and when not to buffer.
>> 
> 
> We did quite a lot of tests on this issue a couple of years ago and found
> that not scanning all of the embedded HTML in one big piece but breaking
> it down to chunks around 400bytes yields better performance.
> 
> I can say that in general, best performance with PHP is achieved when
> using full output buffering. ASP seems to do the same and while we were at
> MS Labs doing benchmarks of PHP vs. ASP this was one of the settings we
> found made PHP compete well with ASP.
> 
> Another change we made, as I mentioned in a previous Email, was using
> non-mutexing per-thread memory pools (HeapCreate(HEAP_NO_SERIALIZE, ...)).
> To get best performance with Apache 2 we would really need such a memory
> pool.

And we already have it!  Just do:

  apr_allocator_t *allocator;

  apr_allocator_create(&allocator);
  apr_pool_create_ex(&pool, parent_pool, abort_fn, allocator);
  apr_allocator_owner_set(allocator, pool);

Now you have a mutexless allocator associated with a pool.  All child pools
of this pool will use the same allocator and will therefor also have no
mutex.

Apache 2.0 uses this extensively in the mpms where we indeed know that only
one thread is going to be accessing a pool.

Why is PHP even using its own memory allocation scheme?  It would be much
easier to just use pools and point out where it doesn't work for you.

Sidenote: the mutex isn't even being used on each and every allocation, only
          when a full 8k block is fully consumed and a new one is needed.

[Pasted this in from a mail in the same thread]
>>* php_request_shutdown() calls shutdown_memory_manager(), which
>>  does a large number of calls to free() per request.  If there's
>>  any way to get the PHP allocator to use an APR pool, that
>>  should help speed things up a lot.  (The mallocs and frees are
>>  going to be especially problematic within multithreaded MPMs.)

> We're already doing this for Win32. Check out 
> ZEND_DO_MALLOC/ZEND_DO_FREE/ZEND_DO_REALLOC in zend_alloc.c. Note that in 
> Win32 we only skip the free's if we're in release mode. If we're in debug 
> mode we use a per-thread pool but we do the frees because it's our memory 
> leak detector.

Just run with --enable-pool-debug=yes and enable a tool like electric fence
or valgrind.  No need to put that stuff in your own code.
 
> Andi

Sander

Re: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Andi Gutmans <an...@zend.com>.

On Fri, 7 Jun 2002, Ryan Bloom wrote:

> On Fri, 7 Jun 2002, Brian Pane wrote:
> 
> > On Fri, 2002-06-07 at 01:14, Sascha Schumann wrote:
> > > > The function php_apache_sapi_ub_write() is inserting a flush bucket
> > > > after each bucket of data that it adds to the output brigade.
> > > 
> > >     The 'ub' stands for unbuffered.. you can avoid that by
> > >     enabling output buffering in php.ini.
> > 
> > IMHO, that's a design flaw.  Regardless of whether PHP is doing
> > buffering, it shouldn't break up blocks of static content into
> > small pieces--especially not as small as 400 bytes.  While it's
> > certainly valid for PHP to insert a flush bucket right before a
> > block of embedded code (in case that code takes a long time to
> > run), breaking static text into 400-byte chunks will usually mean
> > that it takes *longer* for the content to reach the client, which
> > probably defeats PHP's motivation for doing the nonbuffered output.
> > There's code downstream, in the httpd's core_output_filter and
> > the OS's TCP driver, that can make much better decisions about
> > when to buffer and when not to buffer.
> 

We did quite a lot of tests on this issue a couple of years ago and found
that not scanning all of the embedded HTML in one big piece but breaking
it down to chunks around 400bytes yields better performance.

I can say that in general, best performance with PHP is achieved when
using full output buffering. ASP seems to do the same and while we were at
MS Labs doing benchmarks of PHP vs. ASP this was one of the settings we
found made PHP compete well with ASP.

Another change we made, as I mentioned in a previous Email, was using
non-mutexing per-thread memory pools (HeapCreate(HEAP_NO_SERIALIZE, ...)).
To get best performance with Apache 2 we would really need such a memory
pool.

Andi

Re: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Ryan Bloom <rb...@ntrnet.net>.

On Fri, 7 Jun 2002, Brian Pane wrote:

> On Fri, 2002-06-07 at 01:14, Sascha Schumann wrote:
> > > The function php_apache_sapi_ub_write() is inserting a flush bucket
> > > after each bucket of data that it adds to the output brigade.
> > 
> >     The 'ub' stands for unbuffered.. you can avoid that by
> >     enabling output buffering in php.ini.
> 
> IMHO, that's a design flaw.  Regardless of whether PHP is doing
> buffering, it shouldn't break up blocks of static content into
> small pieces--especially not as small as 400 bytes.  While it's
> certainly valid for PHP to insert a flush bucket right before a
> block of embedded code (in case that code takes a long time to
> run), breaking static text into 400-byte chunks will usually mean
> that it takes *longer* for the content to reach the client, which
> probably defeats PHP's motivation for doing the nonbuffered output.
> There's code downstream, in the httpd's core_output_filter and
> the OS's TCP driver, that can make much better decisions about
> when to buffer and when not to buffer.

I would think that this would be a case wherer the php developer is
stating that the server's core_output_filter and the OS's TCP driver CAN'T
make a better decision.

If I were a php developer, I would only use the unbuffered write call if I
was about to perform an operation that was likely to take a long
time.  That way, I could at least start to get content to the user ASAP
(most likely a message explaining the delay), then I could do the
long-lasting operation without worry about my user clicking reload.

Ryan

_______________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
550 Jean St
Oakland CA 94610
-------------------------------------------------------------------------------

Re: PHP profiling results under 2.0.37 Re: Performance of Apache 2.0 Filter

Posted by Sascha Schumann <sa...@apache.org>.

> IMHO, that's a design flaw.  Regardless of whether PHP is doing
> buffering, it shouldn't break up blocks of static content into
> small pieces--especially not as small as 400 bytes.  While it's

    That sounds like the input size of the lexer which is a part
    which I'm not particularly proud of.

> certainly valid for PHP to insert a flush bucket right before a
> block of embedded code (in case that code takes a long time to
> run), breaking static text into 400-byte chunks will usually mean
> that it takes *longer* for the content to reach the client, which
> probably defeats PHP's motivation for doing the nonbuffered output.
> There's code downstream, in the httpd's core_output_filter and
> the OS's TCP driver, that can make much better decisions about
> when to buffer and when not to buffer.

    I doubt that core_output_filter knows the script author's
    intentions very well.  Anyway, Aaron and Cliff posted a patch
    which was committed by Sebastian in mid-April which
    introduced this behaviour.

    /* Add a Flush bucket to the end of this brigade, so that
     * the transient buckets above are more likely to make it out
     * the end of the filter instead of having to be copied into
     * someone's setaside. */
    b = apr_bucket_flush_create(ba);
    APR_BRIGADE_INSERT_TAIL(bb, b);

    The need for this should be reassessed.  Aaron/Cliff, can you
    please have a look at this?

    Please keep in mind that script authors can always use
    flush() to insert the flush-bucket.

    - Sascha