You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Dirk-Willem van Gulik <di...@covalent.net> on 2001/07/24 19:43:09 UTC

OSDLab projectL Accept() locking on large (6+ CPU) boxes

Folks,

Some time ago Sander Temme did some tests to certify Covalent's Apache/SSL
product as part of the SunTone program.

Part of that entails running it on anything from a tiny SUN Ultra T1 all
the way up to 8 way big iron.

We found that, for purely static content, the default locking mechanisms
where far from ideal. And in fact that we had a hard time saturating
machines from small to large with the single configuration across the
whole range of Sun machines.

After playing with some locking strategies we now suspect that this
actually might not be sun specific; but a general multi processor (6 way
and above) box issue.

So we've proposed a small project to the the Open Software Developer Lab
(www.osdlab.org)  to do some careful measuring and tuning attempts across
a _range_ of 1 CPU .. 16 CPU boxes with different locking mechanisms (on
Linux in their case). We expect to find that some mechanisms are good for
few CPU's and some for a lot of CPU's. (And we obviously will experiment
with various types of loads, client-rtt's, etc, etc to understand what is
happening).

In any case; we'll report back here on the list what we find; hopefully
backed by a patch if we can make apache do better.

If you have any suggestions, comments; send them either to this list; or
send them to Sander <sc...@covalent.net> or me - and we'll make sure
they get summarized.

If they turn out to be large questions ... well then we'll just need ask
the OSDLab for some more time or propose a new project.

But we wanted to keep our first project as simple as possible - and to
keep its scope small and timeline short.

Dw.



Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Ben Hyde <bh...@pobox.com>.
Dirk-Willem van Gulik <di...@covalent.net> writes:
> So we've proposed a small project to the the Open Software Developer Lab
> (www.osdlab.org)  to do some careful measuring and tuning attempts across

That's neat and good luck. - ben

-- 
http://www.cozy.org/bhyde.html

Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Tue, Jul 24, 2001 at 11:54:37AM -0700, Justin Erenkrantz wrote:
> On Tue, Jul 24, 2001 at 11:32:25AM -0700, Brian Pane wrote:
> > Are you able to observe this effect experimentally too?  E.g., if
> > you run the threaded MPM on Solaris, does it use just one LWP per
> > process?
> 
> Not with threaded MPM because it uses blocking system calls which allow
> Solaris's scheduler to jump in and rebalance.  Initially, it'll be
> one LWP, but after some load hits, it should spawn multiple LWPs and
> rebalance the threads accordingly.
> 
> However, try testthread on Solaris (with an MP box) and it'll execute 
> all of the threads in serial rather than parallel.  This is what drove
> me crazy last night - forcing me to comb through the manpages until I
> hit upon the pthread_setconcurrency call. 
> 
> By leveraging the pthread_setconcurrency call, the threads are balanced
> across LWPs immediately rather than waiting for each thread to hit a
> blocking system call or yield.  -- justin

That's not necessarily true. According to the man page (on solaris),
by default the OS will only ensure that a sufficient number of threads
(aka LWPs) are active to ensure that the process can continue to make
progress. Who knows what they mean by "can continue to make progress",
but to me this means the most conservative case -- basicly act like a purely
userspace implementation and just multiplex the one main thread.

Anyway, pthread_setconcurrency() is just a hint to the OS as to the level
of concurrency you wish to have, it's no guarantee that it will actually
assign that many active threads (LWPs) to your userspace threads.

Does anyone know of a good place to read about the dirty details of
solaris M-to-N (aka two-level) thread implementation?

-aaron


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Tue, Jul 24, 2001 at 11:32:25AM -0700, Brian Pane wrote:
> Are you able to observe this effect experimentally too?  E.g., if
> you run the threaded MPM on Solaris, does it use just one LWP per
> process?

Not with threaded MPM because it uses blocking system calls which allow
Solaris's scheduler to jump in and rebalance.  Initially, it'll be
one LWP, but after some load hits, it should spawn multiple LWPs and
rebalance the threads accordingly.

However, try testthread on Solaris (with an MP box) and it'll execute 
all of the threads in serial rather than parallel.  This is what drove
me crazy last night - forcing me to comb through the manpages until I
hit upon the pthread_setconcurrency call. 

By leveraging the pthread_setconcurrency call, the threads are balanced
across LWPs immediately rather than waiting for each thread to hit a
blocking system call or yield.  -- justin


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Jeff Trawick <tr...@attglobal.net>.
Aaron Bannert <aa...@ebuilt.com> writes:

> The hard part about this (at least for me) is the multitude of variables
> that go in to deciding the implementation of the accept() mutex. Is there
> a way I can find out (at runtime) what implementation is being used?

Put "AcceptMutex foo" in your apache configuration file.

Either "foo" is the mechanism or Apache will fail to initialize.

(If you really do put "AcceptMutex foo" literally in your config file,
the resulting error message will display suitable replacements for
"foo" on your system :) )

-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...

Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Bill Stoddard <bi...@wstoddard.com>.

> The hard part about this (at least for me) is the multitude of variables
> that go in to deciding the implementation of the accept() mutex. Is there
> a way I can find out (at runtime) what implementation is being used?
> 

httpd  -V

Bill


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Brian Pane <bp...@pacbell.net>.
Aaron Bannert wrote:

[...]

>The hard part about this (at least for me) is the multitude of variables
>that go in to deciding the implementation of the accept() mutex. Is there
>a way I can find out (at runtime) what implementation is being used?
>
If you're willing to use external instrumentation (as opposed to having
the httpd deduce the mechanism by itself at runtime), you can figure out
the accept serialization method with truss.  (The -l arg to truss, if I
remember correctly, will show the LWP ID for each syscall, so you can
look for the last syscall that each LWP does before accept.)

--Brian



Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Tue, Jul 24, 2001 at 11:32:25AM -0700, Brian Pane wrote:
> Aaron Bannert wrote:
> 
> >Hi Dirk,
> >
> >I realize that the OSDlab is [primarily] linux machines, but something
> >that you might want to look into on Solaris with larger-parallel
> >machines is the pthread_setconcurrency() call, which AFAIK is not
> >being called anywhere in httpd or apr.
> >
> >I quoth from the (Solaris 5.8) man page:
> >
> >     Unbound threads in a process may or may not be  required  to
> >     be  simultaneously active. By default, the threads implemen-
> >     tation ensures that  a  sufficient  number  of  threads  are
> >     active  so  that  the process can continue to make progress.
> >     While this conserves system resources, it  may  not  produce
> >     the most effective level of concurrency.
> >
> >
> >My interpretation of this is that by default on solaris, pthreads remain
> >a wholly userspace entity (ie they multiplex one LWP) unless we give a
> >hint to the OS for how many LWPs we'd like to be able to assign to our
> >userspace threads.
> >
> Are you able to observe this effect experimentally too?  E.g., if
> you run the threaded MPM on Solaris, does it use just one LWP per
> process?

I have observed this under conditions where the scheduling was multiplexed
with a pthread_mutex (and only on a uniprocessor machine), but AFAIK
we don't always use pthread_mutex to multiplex access to the accept()
call (ie we don't have to use it when there is no "thundering herd"
problem).  ISTR apache-1.3 used fcntl() as the global-locking mechanism
that multiplexed calls around a "thundering herd" accept() call. All
of these issues, in my mind, could alter the way the OS decides to
schedule blocked LWPs that receive an event.

I will try to do some tests later today on our 2-way solaris machines
to see if the CPU usage seems to favor one process over another, and
then I'll observe the usage under the same load with pthread_setconcurrency()
stuck in there somewhere.

The hard part about this (at least for me) is the multitude of variables
that go in to deciding the implementation of the accept() mutex. Is there
a way I can find out (at runtime) what implementation is being used?

-aaron


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Brian Pane <bp...@pacbell.net>.
Aaron Bannert wrote:

>Hi Dirk,
>
>I realize that the OSDlab is [primarily] linux machines, but something
>that you might want to look into on Solaris with larger-parallel
>machines is the pthread_setconcurrency() call, which AFAIK is not
>being called anywhere in httpd or apr.
>
>I quoth from the (Solaris 5.8) man page:
>
>     Unbound threads in a process may or may not be  required  to
>     be  simultaneously active. By default, the threads implemen-
>     tation ensures that  a  sufficient  number  of  threads  are
>     active  so  that  the process can continue to make progress.
>     While this conserves system resources, it  may  not  produce
>     the most effective level of concurrency.
>
>
>My interpretation of this is that by default on solaris, pthreads remain
>a wholly userspace entity (ie they multiplex one LWP) unless we give a
>hint to the OS for how many LWPs we'd like to be able to assign to our
>userspace threads.
>
Are you able to observe this effect experimentally too?  E.g., if
you run the threaded MPM on Solaris, does it use just one LWP per
process?

--Brian



Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Aaron Bannert <aa...@ebuilt.com>.
Hi Dirk,

I realize that the OSDlab is [primarily] linux machines, but something
that you might want to look into on Solaris with larger-parallel
machines is the pthread_setconcurrency() call, which AFAIK is not
being called anywhere in httpd or apr.

I quoth from the (Solaris 5.8) man page:

     Unbound threads in a process may or may not be  required  to
     be  simultaneously active. By default, the threads implemen-
     tation ensures that  a  sufficient  number  of  threads  are
     active  so  that  the process can continue to make progress.
     While this conserves system resources, it  may  not  produce
     the most effective level of concurrency.


My interpretation of this is that by default on solaris, pthreads remain
a wholly userspace entity (ie they multiplex one LWP) unless we give a
hint to the OS for how many LWPs we'd like to be able to assign to our
userspace threads.

Ideally, the accept()s are happening randomly across processes in the
threaded MPM, but it is totally possible that a single process is having
to take the burden for a large number of accepts(), while other processes
are sitting idle. Given the above condition with pthread_setconcurrency()
on Solaris, that would mean that a single CPU is now burdened with the
processing requirements of this burdened threaded MPM process, while
other CPUs sit around and yawn.

For more discussion of this, please refer to W. Richard Stevens'
"UNIX Network Programming: Interprocess Communications" Vol 2.

Thanks to Justin for bringing this up last night during his coding binge.

-aaron


On Tue, Jul 24, 2001 at 10:43:09AM -0700, Dirk-Willem van Gulik wrote:
> Folks,
> 
> Some time ago Sander Temme did some tests to certify Covalent's Apache/SSL
> product as part of the SunTone program.
> 
> Part of that entails running it on anything from a tiny SUN Ultra T1 all
> the way up to 8 way big iron.
> 
> We found that, for purely static content, the default locking mechanisms
> where far from ideal. And in fact that we had a hard time saturating
> machines from small to large with the single configuration across the
> whole range of Sun machines.
> 
> After playing with some locking strategies we now suspect that this
> actually might not be sun specific; but a general multi processor (6 way
> and above) box issue.
> 
> So we've proposed a small project to the the Open Software Developer Lab
> (www.osdlab.org)  to do some careful measuring and tuning attempts across
> a _range_ of 1 CPU .. 16 CPU boxes with different locking mechanisms (on
> Linux in their case). We expect to find that some mechanisms are good for
> few CPU's and some for a lot of CPU's. (And we obviously will experiment
> with various types of loads, client-rtt's, etc, etc to understand what is
> happening).
> 
> In any case; we'll report back here on the list what we find; hopefully
> backed by a patch if we can make apache do better.
> 
> If you have any suggestions, comments; send them either to this list; or
> send them to Sander <sc...@covalent.net> or me - and we'll make sure
> they get summarized.
> 
> If they turn out to be large questions ... well then we'll just need ask
> the OSDLab for some more time or propose a new project.
> 
> But we wanted to keep our first project as simple as possible - and to
> keep its scope small and timeline short.
> 
> Dw.
> 


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Dirk-Willem van Gulik <di...@covalent.net>.

On 28 Jul 2001, Ian Holsman wrote:

> If you are interested, I could see if we could lend you a 8-way box
> running Solaris 5.8, with a web-avalanche box to generate load sitting
> next to it.

That would be lovely; as the OSDL folks are rather linux based - and that
is an understatment...

> the only caveat would be that your guys would have to sit in our offices
> (near fisherman's wharf in SF)

No trouble; we are on howard between 2nd and 3rd - and even then - this
sort of stuff always warrants some travel.

I'll coordinate with Sander Temme and we'll contact you off-list.

Thanks again,

Dw


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Ian Holsman <ia...@cnet.com>.
Hi Dirk.
If you are interested, I could see if we could lend you a 8-way box
running
Solaris 5.8, with a web-avalanche box to generate load sitting next to
it.

the only caveat would be that your guys would have to sit in our offices
(near fisherman's wharf in SF)

If your interested, I'll see if I can get it going on our end.

..Ian

On 24 Jul 2001 16:38:46 -0700, Dirk-Willem van Gulik wrote:
> 
> 
> On Tue, 24 Jul 2001, Bill Stoddard wrote:
> 
> > Some folks in the Websphere performance team did some benchmarking on
> > machines from Sun all the way up to 8-ways. Victor was feeding them
> > Apache builds to play with. We tried the following accept locks:
> > fcntl, native Solaris, sysv, pthread.  fcnlt was slowest on all
> > machines with 4 CPUs or less.  fcntl was the fastest on the 8 way
> > machine.
> 
> This matches what we found.
> 
> > sysv, pthread and native Solaris locks all appeared to actually
> > degrade performance as CPUs were added and in the same way which leads
> > me to believe that sysv, pthread and native Solaris locks are all the
> > same implementation under the covers.
> 
> Same here; we looked under the covers with sar - and they certainly share
> the same mechanism.
> 
> > We did the same tests with AIX. I am not advertising here, but Apache
> > on AIX using pthread mutexes kicks Solaris butt big time on the 8 way
> > (and above) machines. Apache performance on AIX 8 way machines is MUCH
> > better than Solaris.
> 
> Though based on operational experience I certainly have found the same;
> big fat AIX do really, really, really well -and behave incredibly nice and
> robust as you add processor after processor..... I am not sure that:
> 
> > Each OS has strengths and weaknesses and it appears that Solaris is
> > very weak in cross process locking on n-way machines. So I would say
> > that the problem is Sun specfic. Haven't done similar measurements
> > with Linux.
> 
> is so solaris specific. I think there is also something fundamental in the
> way we do things in apache. But time in the lab will tell :-)
> 
> Dw
-- 
Ian Holsman
Performance Measurement & Analysis
CNET Networks    -    415 364-8608


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Dirk-Willem van Gulik <di...@covalent.net>.

On Tue, 24 Jul 2001, Bill Stoddard wrote:

> Some folks in the Websphere performance team did some benchmarking on
> machines from Sun all the way up to 8-ways. Victor was feeding them
> Apache builds to play with. We tried the following accept locks:
> fcntl, native Solaris, sysv, pthread.  fcnlt was slowest on all
> machines with 4 CPUs or less.  fcntl was the fastest on the 8 way
> machine.

This matches what we found.

> sysv, pthread and native Solaris locks all appeared to actually
> degrade performance as CPUs were added and in the same way which leads
> me to believe that sysv, pthread and native Solaris locks are all the
> same implementation under the covers.

Same here; we looked under the covers with sar - and they certainly share
the same mechanism.

> We did the same tests with AIX. I am not advertising here, but Apache
> on AIX using pthread mutexes kicks Solaris butt big time on the 8 way
> (and above) machines. Apache performance on AIX 8 way machines is MUCH
> better than Solaris.

Though based on operational experience I certainly have found the same;
big fat AIX do really, really, really well -and behave incredibly nice and
robust as you add processor after processor..... I am not sure that:

> Each OS has strengths and weaknesses and it appears that Solaris is
> very weak in cross process locking on n-way machines. So I would say
> that the problem is Sun specfic. Haven't done similar measurements
> with Linux.

is so solaris specific. I think there is also something fundamental in the
way we do things in apache. But time in the lab will tell :-)

Dw


Re: OSDLab projectL Accept() locking on large (6+ CPU) boxes

Posted by Bill Stoddard <bi...@wstoddard.com>.
Some folks in the Websphere performance team did some benchmarking on machines from Sun
all the way up to 8-ways. Victor was feeding them Apache builds to play with. We tried the
following accept locks: fcntl, native Solaris, sysv, pthread.  fcnlt was slowest on all
machines with 4 CPUs or less.  fcntl was the fastest on the 8 way machine.  sysv, pthread
and native Solaris locks all appeared to actually degrade performance as CPUs were added
and in the same way which leads me to believe that sysv, pthread and native Solaris locks
are all the same implementation under the covers.

We did the same tests with AIX. I am not advertising here, but Apache on AIX using pthread
mutexes kicks Solaris butt big time on the 8 way (and above) machines. Apache performance
on AIX 8 way machines is MUCH better than Solaris. Each OS has strengths and weaknesses
and it appears that Solaris is very weak in cross process locking on n-way machines. So I
would say that the problem is Sun specfic. Haven't done similar measurements with Linux.

Bill

> Folks,
>
> Some time ago Sander Temme did some tests to certify Covalent's Apache/SSL
> product as part of the SunTone program.
>
> Part of that entails running it on anything from a tiny SUN Ultra T1 all
> the way up to 8 way big iron.
>
> We found that, for purely static content, the default locking mechanisms
> where far from ideal. And in fact that we had a hard time saturating
> machines from small to large with the single configuration across the
> whole range of Sun machines.
>
> After playing with some locking strategies we now suspect that this
> actually might not be sun specific; but a general multi processor (6 way
> and above) box issue.
>
> So we've proposed a small project to the the Open Software Developer Lab
> (www.osdlab.org)  to do some careful measuring and tuning attempts across
> a _range_ of 1 CPU .. 16 CPU boxes with different locking mechanisms (on
> Linux in their case). We expect to find that some mechanisms are good for
> few CPU's and some for a lot of CPU's. (And we obviously will experiment
> with various types of loads, client-rtt's, etc, etc to understand what is
> happening).
>
> In any case; we'll report back here on the list what we find; hopefully
> backed by a patch if we can make apache do better.
>
> If you have any suggestions, comments; send them either to this list; or
> send them to Sander <sc...@covalent.net> or me - and we'll make sure
> they get summarized.
>
> If they turn out to be large questions ... well then we'll just need ask
> the OSDLab for some more time or propose a new project.
>
> But we wanted to keep our first project as simple as possible - and to
> keep its scope small and timeline short.
>
> Dw.
>
>