You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brian Pane <bp...@pacbell.net> on 2001/09/05 05:00:35 UTC

remaining CPU bottlenecks in 2.0

I'm currently studying profiling data from an httpd built from
a CVS snapshot earlier today.

In general, the performance of 2.0 is starting to look good.

Here are the top 30 functions, ranked according to their CPU utilization.
(This summary is derived from a Quantify profile of a threaded-MPM httpd
handling mod_include requests on Solaris; thanks to Blaise Tarr at CNET
for the profile data.)

                         CPU time
   function             (% of total)
   --------             ------------
 1. find_start_sequence    23.9
 2. mmap                   14.1
 3. munmap                 10.2
 4. _writev                 9.4
 5. _so_accept              4.7
 6. _lwp_mutex_unlock       4.3
 7. stat                    3.2
 8. _close                  2.4
 9. _write                  2.4
10. __door_return           2.3
                                                          cumulative 
total: 76.9%
11. __open                  2.0
12. _so_getsockname         1.6
13. find_entry              1.5
14. strlen                  1.2
15. strcasecmp              1.2
16. _so_shutdown            1.2
17. _read                   0.84
18. fstat64                 0.82
19. __lwp_sema_post         0.80
20. memset                  0.78
                                 cumulative total: 88.8%
21. apr_hash_next           0.53
22. apr_file_read           0.46
23. _lrw_unlock             0.39
24. __fork1                 0.39
25. apr_palloc              0.39
26. strncpy                 0.28
27. memchr                  0.26
28. _ltzset_u               0.24
29. overlay_extension_mappings    0.20
30. apr_hash_set            0.20
                                 cumulative total: 92.18%

Some notes on the data:

* The discussion here covers only CPU utilization.  There are other
  aspects of performance, like multiprocessor scalability, that
  are independent of this data.

* 7 of the top 10 functions are syscalls that I'd expect to see in
  the top 10 for an httpd. :-)

* find_start_sequence() is the main scanning function within
  mod_include.  There's some research in progress to try to speed
  this up significantly.

* With the exception of find_start_sequence(), all the opportunities
  for optimization are individually small.  I expect that the only
  way to get an additional 5% reduction in CPU usage, for example,
  will be to fix ten things that each account for 0.5%.

* _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
  but only from a small fraction of pthread_mutex_unlock calls
  (Can someone familiar with Solaris threading internals explain
  this one?)

* Collectively, stat and open comprise 5% of the total CPU time.
  It would be faster to do open+fstat rather than stat+open (as
  long as the server is delivering mostly 200s rather than 304s),
  but that might be too radical a change.  Anybody have thoughts
  on this?

* I think the __door_return() call is an anomaly of the profiling
  instrumentation itself.

* find_entry() is part of the apr_hash_t implementation.  Most of
  the calls to find_entry()and apr_hash_next() are coming from the
  mod_mime dir-merge function.

* strlen() is called from lots of places throughout the code, with
  the most frequent calls being from apr_pstrdup, apr_pstrcat, and
  time-formatting functions used in apr_rfc822_date.

* strcasecmp() is used heavily by the apr_table_t implementation
  (sort_overlap, apr_table_get, apr_table_setn).  Converting tables
  to hashes is is a potential speedup.  However, most of the table
  API calls are on request_rec->headers_(in|out); changing those
  would impact a *lot* of code.

* fstat64() happens during fcntl() on sockets (it appears to be a
  Solaris implementation detail)

* memset() is called mostly from apr_pcalloc(), which in turn is
  used in too many places to yield any easy optimization opportunities.

* memchr() is called from apr_pstrndup(), used mostly in apr_uri_parse()
  and mod_mime's analyze_ct().  It may be possible to optimize away
  some of the calls in the latter.

--Brian



Re: remaining CPU bottlenecks in 2.0

Posted by Aaron Bannert <aa...@clove.org>.
On Wed, Sep 05, 2001 at 08:40:24PM -0700, Justin Erenkrantz wrote:
> On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote:
> > I got 1 more question about the solaris implementation
> > of the Threaded/Worker MPM.
> > 
> > 
> > should we be called the setconcurrency flag on startup ?
> > I know solaris figures it out along the way, but a bit of gentle
> > prodding never hurt.
> 
> Aaron and I went back and forth on this.  He was of the mind that
> the user scheduler should figure it out after a period of time
> (and that we shouldn't muck with it).

Actually, I seem to recall suggesting that we add something like
apr_thread_setconcurrency() that attempts to do the right thing if the
platform supports it (and if it doesn't, it's probably no big deal),
and is otherwise just a noop. This may help solaris adapt quicker to a
fast ramp-up of userspace-to-LWP thread mapping, especially when there
are many compute-bound threads that are initially unwilling to yield. I
have a feeling this may be more important on SMP boxes, where threads
should be migrated to become bound threads as soon as possible.


> The badness happens in testthread.c - Solaris executes that code
> in serial since the user scheduler never has a chance to 
> interrupt and balance the threads.  But, in the "real" code, 
> Solaris balances it since we hit syscalls and mutexes quite often.
> 
> An equivalent to calling pthread_setconcurrency is to use
> /usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so.

ISTR that this will cause every thread to become a bound thread, effectively
eliminating the [theoretical] benefits of a partial userspace thread impl.

-aaron

Re: remaining CPU bottlenecks in 2.0

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote:
> I got 1 more question about the solaris implementation
> of the Threaded/Worker MPM.
> 
> 
> should we be called the setconcurrency flag on startup ?
> I know solaris figures it out along the way, but a bit of gentle
> prodding never hurt.

Aaron and I went back and forth on this.  He was of the mind that
the user scheduler should figure it out after a period of time
(and that we shouldn't muck with it).

The badness happens in testthread.c - Solaris executes that code
in serial since the user scheduler never has a chance to 
interrupt and balance the threads.  But, in the "real" code, 
Solaris balances it since we hit syscalls and mutexes quite often.

An equivalent to calling pthread_setconcurrency is to use
/usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so.
-- justin


Re: remaining CPU bottlenecks in 2.0

Posted by Ian Holsman <ia...@cnet.com>.
I got 1 more question about the solaris implementation
of the Threaded/Worker MPM.


should we be called the setconcurrency flag on startup ?
I know solaris figures it out along the way, but a bit of gentle
prodding never hurt.

..Ian
Brian Pane wrote:

> Justin Erenkrantz wrote:
> 
> [...]
> 
>>> * The discussion here covers only CPU utilization.  There are other
>>>  aspects of performance, like multiprocessor scalability, that
>>>  are independent of this data.
>>>
>>
>> Once we get the syscalls optimized (I'm reminded of Dean's attack
>> on our number of syscalls in 1.3 - I believe he went through syscall
>> by syscall trying to eliminate all of the unnecessary ones), I think 
>> the next performance point will be MP scalability (see below for
>> lock scalability on solaris).  But, we do need to see what we can do 
>> about optimizing the syscalls though...
>>
> The syscall profile for 2.0 looks close to optimal (as long as you
> disable .htaccess files and symlink checking).  We're doing a few
> too many gettimeofday calls per request, though; I 'll investigate
> whether any of these can be optimized away.  (Fortunately, gettimeofday
> is cheap compared to time(2).)
> 
> [...]
> 
>>> * _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
>>>  but only from a small fraction of pthread_mutex_unlock calls
>>>  (Can someone familiar with Solaris threading internals explain
>>>  this one?)
>>>
>>
>> The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
>> LWP scheduler is a user-space library so it gets thrown in
>> with our numbers I bet.
>>
>> Here's some background on Solaris's implementation that I
>> think may provide some useful information as to how the locks will 
>> perform overall.  (If you spot any inconsistencies, it is
>> probably my fault...I'm going to try to explain this as best as
>> I can...)
>>
>> First off, Solaris has adaptive locks.  Depending if the owner of the 
>> lock is currently active, it will spin.  If the system sees
>> that the owner of the held lock is not currently active, it will
>> sleep (they call this an adaptive lock - it now enters a turnstile).
>>
> Ah, I guess that explains why only a small fraction of pthread_mutex_lock
> calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
> case where the lock is available, it just stays in user-mode code?
> 
> --Brian
> 




Re: remaining CPU bottlenecks in 2.0

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Wed, Sep 05, 2001 at 08:19:50PM -0700, Brian Pane wrote:
> Ah, I guess that explains why only a small fraction of pthread_mutex_lock
> calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
> case where the lock is available, it just stays in user-mode code?

Yes.  -- justin


Re: remaining CPU bottlenecks in 2.0

Posted by Brian Pane <bp...@pacbell.net>.
Justin Erenkrantz wrote:

[...]

>>* The discussion here covers only CPU utilization.  There are other
>>  aspects of performance, like multiprocessor scalability, that
>>  are independent of this data.
>>
>
>Once we get the syscalls optimized (I'm reminded of Dean's attack
>on our number of syscalls in 1.3 - I believe he went through syscall
>by syscall trying to eliminate all of the unnecessary ones), I think 
>the next performance point will be MP scalability (see below for
>lock scalability on solaris).  But, we do need to see what we can 
>do about optimizing the syscalls though...
>
The syscall profile for 2.0 looks close to optimal (as long as you
disable .htaccess files and symlink checking).  We're doing a few
too many gettimeofday calls per request, though; I 'll investigate
whether any of these can be optimized away.  (Fortunately, gettimeofday
is cheap compared to time(2).)

[...]

>>* _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
>>  but only from a small fraction of pthread_mutex_unlock calls
>>  (Can someone familiar with Solaris threading internals explain
>>  this one?)
>>
>
>The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
>LWP scheduler is a user-space library so it gets thrown in
>with our numbers I bet.
>
>Here's some background on Solaris's implementation that I
>think may provide some useful information as to how the locks 
>will perform overall.  (If you spot any inconsistencies, it is
>probably my fault...I'm going to try to explain this as best as
>I can...)
>
>First off, Solaris has adaptive locks.  Depending if the owner of 
>the lock is currently active, it will spin.  If the system sees
>that the owner of the held lock is not currently active, it will
>sleep (they call this an adaptive lock - it now enters a turnstile).
>
Ah, I guess that explains why only a small fraction of pthread_mutex_lock
calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
case where the lock is available, it just stays in user-mode code?

--Brian



Re: remaining CPU bottlenecks in 2.0

Posted by Brian Pane <bp...@pacbell.net>.
Greg Stein wrote:

>On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote:
>
>>...
>>You were discussing the possibility of parsing for <!--# as a skip by 5.
>>
>>Consider jumping to a 4 byte alignment, truncating to char and skip by
>>dwords.  E.g., you only have to test for three values, not four, and you
>>can use the machine's most optimal path.  But I'd ask if strstr() isn't
>>optimized on the platform, why are we reinventing it?
>>
>
>strstr() can't find strings that span a bucket boundary.
>
In addition, the most promising search algorithms for this problem
require a lookup table that contains metadata about the search pattern.
We can save time by precomputing the lookup table for "<!--#" at server
startup (or even at compile time), which is something that strstr()
can't do.

--Brian




Re: remaining CPU bottlenecks in 2.0

Posted by Greg Stein <gs...@lyra.org>.
On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote:
>...
> You were discussing the possibility of parsing for <!--# as a skip by 5.
> 
> Consider jumping to a 4 byte alignment, truncating to char and skip by
> dwords.  E.g., you only have to test for three values, not four, and you
> can use the machine's most optimal path.  But I'd ask if strstr() isn't
> optimized on the platform, why are we reinventing it?

strstr() can't find strings that span a bucket boundary.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: remaining CPU bottlenecks in 2.0

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
From: "Justin Erenkrantz" <je...@ebuilt.com>
Sent: Tuesday, September 04, 2001 11:46 PM


> Based on the patches you submitted (and my quasi-errant formatting
> patch), I had to read most of the code in mod_include, so I'm more 
> familiar with mod_include now.  I do think there are some obvious 
> ways to optimize find_start_sequence.  I wonder if we could apply 
> a KMP-string matching algorithm here.  I dunno.  I'll take a look 
> at it though.  Something bugs me about the restarts.  I bet that 
> we spend even more time in find_start_sequence when a HTML file 
> has lots of comments.  =-)

You were discussing the possibility of parsing for <!--# as a skip by 5.

Consider jumping to a 4 byte alignment, truncating to char and skip by
dwords.  E.g., you only have to test for three values, not four, and you
can use the machine's most optimal path.  But I'd ask if strstr() isn't
optimized on the platform, why are we reinventing it?

This is DSS for little endian (that char bugger comes from the first byte
so skipping in 0-3 is not an issue) but for big endian architectures you 
need to backspace to the dword alignment so you don't miss the tag by
skipping up to 6 (wrong by 3, and then reading fourth byte of dword.)

That has got to be your most optimal search pattern.

Bill


mod_include.c WAS: RE: remaining CPU bottlenecks in 2.0

Posted by Sander Striker <st...@apache.org>.
>> * find_start_sequence() is the main scanning function within
>>   mod_include.  There's some research in progress to try to speed
>>   this up significantly.
> 
> Based on the patches you submitted (and my quasi-errant formatting
> patch), I had to read most of the code in mod_include, so I'm more 
> familiar with mod_include now.  I do think there are some obvious 
> ways to optimize find_start_sequence.  I wonder if we could apply 
> a KMP-string matching algorithm here.  I dunno.  I'll take a look 
> at it though.  Something bugs me about the restarts.  I bet that 
> we spend even more time in find_start_sequence when a HTML file 
> has lots of comments.  =-)

I suggested to Ian yesterday night that I'd look into a Boyer-Moore
matching algorithm.  I'll work on that too.

Sander


Re: remaining CPU bottlenecks in 2.0

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Tue, Sep 04, 2001 at 08:00:35PM -0700, Brian Pane wrote:
> I'm currently studying profiling data from an httpd built from
> a CVS snapshot earlier today.
> 
> In general, the performance of 2.0 is starting to look good.

Cool.  This probably means the code is starting to look good, too.

> * The discussion here covers only CPU utilization.  There are other
>   aspects of performance, like multiprocessor scalability, that
>   are independent of this data.

Once we get the syscalls optimized (I'm reminded of Dean's attack
on our number of syscalls in 1.3 - I believe he went through syscall
by syscall trying to eliminate all of the unnecessary ones), I think 
the next performance point will be MP scalability (see below for
lock scalability on solaris).  But, we do need to see what we can 
do about optimizing the syscalls though...

> * find_start_sequence() is the main scanning function within
>   mod_include.  There's some research in progress to try to speed
>   this up significantly.

Based on the patches you submitted (and my quasi-errant formatting
patch), I had to read most of the code in mod_include, so I'm more 
familiar with mod_include now.  I do think there are some obvious 
ways to optimize find_start_sequence.  I wonder if we could apply 
a KMP-string matching algorithm here.  I dunno.  I'll take a look 
at it though.  Something bugs me about the restarts.  I bet that 
we spend even more time in find_start_sequence when a HTML file 
has lots of comments.  =-)

> * strlen() is called from lots of places throughout the code, with
>   the most frequent calls being from apr_pstrdup, apr_pstrcat, and
>   time-formatting functions used in apr_rfc822_date.

I think someone has brought up that apr_pstrdup does an extra strlen.
I'll have to review that code.

> * _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
>   but only from a small fraction of pthread_mutex_unlock calls
>   (Can someone familiar with Solaris threading internals explain
>   this one?)

The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
LWP scheduler is a user-space library so it gets thrown in
with our numbers I bet.

Here's some background on Solaris's implementation that I
think may provide some useful information as to how the locks 
will perform overall.  (If you spot any inconsistencies, it is
probably my fault...I'm going to try to explain this as best as
I can...)

First off, Solaris has adaptive locks.  Depending if the owner of 
the lock is currently active, it will spin.  If the system sees
that the owner of the held lock is not currently active, it will
sleep (they call this an adaptive lock - it now enters a turnstile).

Okay, so what happens when a mutex unlocks?  This depends on
whether you are in spin or adaptive lock.  Spin locks 
immediately see the freed lock and the first one on the CPU
grabs it (excluding priority inversion here.).

But, the more interesting issue is what happens when we are in
an adaptive lock.  According to Mauro, Solaris 7+ has a thundering 
herd condition for adaptive kernel locks.  It will wake up *all* 
waiters and lets them fight it out.  This was a difference from 
Solaris 2.5.1 and 2.6.  

Since you should never have a lot of threads sitting in a mutex 
(according to Sun, it is typical in practice to only have one 
kernel thread waiting), this thundering herd is okay and 
actually performs better than freeing the lock and only waking 
up one waiter.  They say it makes the code much cleaner.  I'm
betting we're overusing the mutexes which changes the equation
considerably.  =-)

Okay, that tells you how it is done in kernel-space.  For
user-space (see below how to remove user-space threading), 
it is slightly different.  Remember in Solaris, we have
a two-tier thread-model - user-space and kernel threads.

In user-space, when we call pthread_mutex_unlock, we hit 
_mutex_unlock in liblwp, which calls mutex_unlock_adaptive 
as we aren't a special case lock.

So, what does mutex_unlock_adaptive do in lwp?  In pseudocode (as 
best as I can explain it), it first tries to see if there is a 
waiter in the current LWP, if so, it clears the lock bit and the 
other thread in the same LWP takes it.  If no thread in the same 
LWP is waiting, it will then "sleep" for ~500 while loop 
iterations to let another thread in the same LWP take the thread.
(On a UP box, it'll exit here before doing the while loop as it 
knows that spinlocks are stupid.  I think you were testing on a 
MP box.)  If the while loop iteration concludes without anyone 
acquiring the lock, it sends it to the kernel as no one in this
LWP cares for that lock (and we hit the semantics described above).

I'm wondering if this while loop iteration (essentially a spin
lock) may be the root cause of the lwp_mutex_unlock utilization 
you are seeing.

Anyway, in Solaris 9, IIRC, they have removed the userspace 
scheduler and all threads are now bound (all threads map 
directly to kernel threads).  You may acheive the same result 
in earlier versions of Solaris by linking against 
/usr/lib/lwp/libthread.so - which is also configurable by 
changing your run-time link order.  

You will definitely see different performance characteristics
on Solaris with the two libthread.so's.  I'd encourage you
to test with bound threads.  =-)

For more Solaris-specific information, I would recommend getting 
both Solaris Internals by Jim Mauro and Richard McDougall (both
are Sun Senior Engineers in the Performance Applications 
Engineering group - aka kernel freaks) and the Solaris source 
code - both are good references.  I'm looking at:

s8_foundation_src_en/osnet_volume/usr/src/lib/liblwp/common/synch.c
s8_foundation_src_en/osnet_volume/usr/src/uts/common/os/mutex.c
s8_foundation_src_en/osnet_volume/usr/src/uts/common/syscall/lwp_sobj.c

and Chapter 3 in Solaris Internals.

If anyone is interested, I can email the mutex.c to them as it is 
a very interesting read and goes into their decisions as to why 
they wake all waiters with mutex_exit.  (It's big and I'm not 
sure what the copyright restrictions are exactly...)

HTH.  -- justin


Re: remaining CPU bottlenecks in 2.0

Posted by Brian Pane <bp...@pacbell.net>.
dean gaudet wrote:

[...]

>>* memset() is called mostly from apr_pcalloc(), which in turn is
>>  used in too many places to yield any easy optimization opportunities.
>>
>
>sometimes folks are lazy and ask for zeroed memory out of habit, when they
>could easily deal with garbage at less cost.
>
Some good news here: I dug through the profile data and found that over half
the httpd's calls to apr_pcalloc are from one function, 'find_entry' in the
apr_hash_set implementation.  It uses apr_pcalloc to allocate a struct with
5 fields, and then it immediately overwrites 4 of those fields.  I'll submit
a patch to the APR dev list to fix this.

--Brian



Re: remaining CPU bottlenecks in 2.0

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
From: "dean gaudet" <de...@arctic.org>
Sent: Wednesday, September 05, 2001 9:17 PM


> On Tue, 4 Sep 2001, Brian Pane wrote:
> 
> > * memset() is called mostly from apr_pcalloc(), which in turn is
> >   used in too many places to yield any easy optimization opportunities.
> 
> sometimes folks are lazy and ask for zeroed memory out of habit, when they
> could easily deal with garbage at less cost.

True.  Or folks are being pedantic and summarizing a dozen NULL assignments
immediately after an alloc - I'd suggest that group takes a leap into a commented
block just for clarity, and we skip initializing long lists of NULL values.
Use pcalloc where the majority of the allocation is zeroed, and palloc everywhere
else.


Re: remaining CPU bottlenecks in 2.0

Posted by Brian Pane <bp...@pacbell.net>.
dean gaudet wrote:

>On Tue, 4 Sep 2001, Brian Pane wrote:
>
>>* Collectively, stat and open comprise 5% of the total CPU time.
>>  It would be faster to do open+fstat rather than stat+open (as
>>  long as the server is delivering mostly 200s rather than 304s),
>>  but that might be too radical a change.  Anybody have thoughts
>>  on this?
>>
>
>linux' dcache makes either method about the same cost, but on other
>kernels, and especially when NFS is involved (because of how NFS path
>resolution occurs), the open/fstat is much better.  the trick would be to
>cache open filehandles so that fstat can be used.
>
I realized that it's now possible to do open+fstat in 2.0 even without
changes to the core: just create a custom module consisting of a
map_to_storage function that bypasses directory_walk() and a handler
that does apr_file_open() and apr_file_info_get().  I tried a prototype
this morning and found that it "does the right thing": no stats, just
open followed by fstat.

This approach isn't suitable for all users, of course.  It's not
very useful for people who depend on directory_walk(). :-)  But
for sites delivering a lot of traffic from NFS filesystems, using this
technique in a custom module could substantially reduce NFS server load.
--Brian



Re: remaining CPU bottlenecks in 2.0

Posted by dean gaudet <de...@arctic.org>.
On Tue, 4 Sep 2001, Brian Pane wrote:

> * Collectively, stat and open comprise 5% of the total CPU time.
>   It would be faster to do open+fstat rather than stat+open (as
>   long as the server is delivering mostly 200s rather than 304s),
>   but that might be too radical a change.  Anybody have thoughts
>   on this?

linux' dcache makes either method about the same cost, but on other
kernels, and especially when NFS is involved (because of how NFS path
resolution occurs), the open/fstat is much better.  the trick would be to
cache open filehandles so that fstat can be used.

> * memset() is called mostly from apr_pcalloc(), which in turn is
>   used in too many places to yield any easy optimization opportunities.

sometimes folks are lazy and ask for zeroed memory out of habit, when they
could easily deal with garbage at less cost.

-dean