You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apr.apache.org by Aaron Bannert <aa...@clove.org> on 2001/09/15 00:44:48 UTC

[proposal] apr_thread_setconcurrency()

I'd like to propose we add a call that gives a hint to the OS as to
the level of concurrency we wish to have. This would mirror
pthread_setconcurrency(), and would be a simple call to that on
operating systems that have it available. On other platforms it
would be simple noop.

Give me some +1s and I'll submit a patch.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Ryan Bloom <rb...@covalent.net>.

> If you create too many LWPs, you will lose a lot of optimizations
> that are present in Solaris (i.e. handover of a mutex to another
> thread in the same LWP - as discussed with bpane on dev@httpd
> recently).  If you don't create enough LWPs, you may enter a
> condition where the scheduler refuses to balance the processes
> correctly (it also acts as a ceiling).  0 lets the OS determine
> the concurrency (on Solaris).
>
> By setting a value, you are attempting to circumvent the OS
> scheduler.  If you ask it to set the concurrency on Solaris, it
> *will* create enough LWPs to equal that concurrency (as you
> create threads to be paired with LWPs).  This is not a hint, but a
> command.  (Yes, the man page for Solaris says that it is a hint,
> but it treats it as a command.)

I don't understand how you can say this.  According to single Unix:

"The pthread_setconcurrency() function allows an application to inform 
the threads implementation of its desired concurrency level, new_level. 
The actual level of concurrency provided by the implementation as a 
result of this function call is unspecified." 

If Solaris is using the setconcurrency value as a command, then it is
absolutely horked.

As for whether this is a valid thing to do because it circumvents the OS,
of course it's valid.  The OS is written to be generic, because that is the
only way to write a useful OS.  The programmer who is writing an
application knows better than the OS what the thread concurrency should
be for their application.  Generalized code performs worse than code than
is written for a specific application as a general rule.

The OS has to use a slow start to find the best concurrancy, because 
otherwise it will create too many LWP's.  With a web server, we know better
than the OS.

Ryan
______________________________________________________________
Ryan Bloom				rbb@apache.org
Covalent Technologies			rbb@covalent.net
--------------------------------------------------------------

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> The testlock case doesn't matter because it never hits any of the 
> Solaris-defined entry points.  This is a quirk in the OS and I see 
> no reason to work around it.  If you want to make testlock do the 
> right thing with the Solaris LWP model, use a reader/writer lock
> to synchronize the starting of the threads.  This way you guarantee 
> that all threads are started before you start execution of the 
> tight exclusive loop (which is something that testlock doesn't do 
> now).  You are assuming that the threads are created in parallel -
> nowhere is that ordering is guaranteed.

I noticed that your new testlockperf.c does exactly that (testlock.c
doesn't).  Do you still see the serialization on Solaris MP with 
LWPs?  I will try running it here and see what happens.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Brian Pane <bp...@pacbell.net>.

Aaron Bannert wrote:

>On Mon, Sep 17, 2001 at 10:17:16AM -0700, Brian Pane wrote:
>
>>>So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
>>>client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
>>>requests I expect to handle (aka the number of worker threads).
>>>
>>How were you handling 25 simultaneous requests with just
>>10 concurrent connnections in ab?
>>
>
>Bad wording on my part... It stabilized at the number of worker threads
>being used in the system.
>

The quick re-use of the workers in an ab test probably explains why
Solaris ended up creating one LWP per worker thread in this test.

But I wouldn't extend that observation to say that it's a good idea in
general to set the concurrency hint to the number of worker threads.
In the real world, each thread tends to spend a lot more time waiting
for I/O than it does during a stress test.  If you're running a server
with 500 worker threads, you probably don't want want 500 LWPs.

>Since the worker queue is FIFO, all the worker threads are used fairly
>soon after they enter the queue. I'll be changing this to LIFO in the
>near future (per Dean's suggestion) for possible cache hits, etc...
>

My hypothesis is that the number of LWPs will drop to ~13 when you
do this: 10 for the concurrent connections, plus the 3 built-in ones.

--Brian

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Mon, Sep 17, 2001 at 10:17:16AM -0700, Brian Pane wrote:
> >So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
> >client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
> >requests I expect to handle (aka the number of worker threads).
> >
> How were you handling 25 simultaneous requests with just
> 10 concurrent connnections in ab?

Bad wording on my part... It stabilized at the number of worker threads
being used in the system.

Since the worker queue is FIFO, all the worker threads are used fairly
soon after they enter the queue. I'll be changing this to LIFO in the
near future (per Dean's suggestion) for possible cache hits, etc...

(My poor linux box doesn't push 25 simultaneous requests very well :)

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Brian Pane <bp...@pacbell.net>.

Aaron Bannert wrote:

>On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
>
>>On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
>>
>>>You also haven't mentioned how many LWPs it stabilized at after
>>>10 seconds?  Did Solaris choose to add a LWP for each user thread?  
>>>I have a feeling it wouldn't, but I may be wrong.  -- justin
>>>
>>I'll follow up this reply with some real numbers.
>>
>
>Uniprocessor Solaris 8 (7/01) i86pc (Athlon)
>worker MPM
>ApacheBench with 10 concurrent requests to a very large shtml page.
>I'm getting around 150r/s average.
>
><IfModule worker.c>
>StartServers         1
>MaxClients           1
>MinSpareThreads      5
>MaxSpareThreads     75
>ThreadsPerChild     25
>MaxRequestsPerChild  0
></IfModule>
>
>The worker MPM has these userspace threads:
>
>main (signal handler) thread
>thread_starter
>1x listener_thread
>ThreadsPerChild number of worker_threads
>
>
>'top' is reporting 28 LWPs after hitting around 60,000 requests as fast as AB
>can go (I hit it for at least a few minutes).
>
>So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
>client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
>requests I expect to handle (aka the number of worker threads).
>
How were you handling 25 simultaneous requests with just
10 concurrent connnections in ab?

--Brian

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> > You also haven't mentioned how many LWPs it stabilized at after
> > 10 seconds?  Did Solaris choose to add a LWP for each user thread?  
> > I have a feeling it wouldn't, but I may be wrong.  -- justin
> 
> I'll follow up this reply with some real numbers.

Uniprocessor Solaris 8 (7/01) i86pc (Athlon)
worker MPM
ApacheBench with 10 concurrent requests to a very large shtml page.
I'm getting around 150r/s average.

<IfModule worker.c>
StartServers         1
MaxClients           1
MinSpareThreads      5
MaxSpareThreads     75
ThreadsPerChild     25
MaxRequestsPerChild  0
</IfModule>

The worker MPM has these userspace threads:

main (signal handler) thread
thread_starter
1x listener_thread
ThreadsPerChild number of worker_threads

'top' is reporting 28 LWPs after hitting around 60,000 requests as fast as AB
can go (I hit it for at least a few minutes).

So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
requests I expect to handle (aka the number of worker threads).

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Ian Holsman <ia...@cnet.com>.

On Sun, 2001-09-16 at 20:13, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> > I don't think it's a quirk of the thread library, I think it's
> > fully expected. For the sake of others, here's an excerpt from the
> > Solaris 8 pthread_setconcurrency(3THR) man page:
> 
> In testlockperf, you are assuming that all of the threads have 
> started and will compete for the locks.  In a M:N implementation, 
> this assumption is false.  You end up executing in serial rather
> than in parallel.  This only occurs because you never hit a
> user-scheduler entry point in testlockperf.  In the case of a MPM,
> you will be hitting them left and right.  =-)
> 
> Therefore, you need to devise a strategy within testlockperf to 
> ensure that all of the threads are ready to compete before 
> continuing the test.  The suggested sleep is one way - condition
> variables *may* be possible, but it isn't completely obvious to
> me how that would work.  -- justin
> 
> P.S. If you are running a site where you get 50,000 hits a minute,
> you shouldn't have MRPC at 10,000.  I'd be curious to see what
> cnet runs with.

on our heaviest day (the bombing) we we're getting ~7,500 HTML pages 
a minute. assuming ~6 images per page we got ~50,000 hits a minute.
(on a single machine)
this wasn't a normal day, we don't normally do THAT much traffic.
we currently have Max Requests Per Child set at '512' on our 1.3 
servers, mainly due to memory leaks.

..ian


-- 
Ian Holsman          IanH@cnet.com
Performance Measurement & Analysis
CNET Networks   -   (415) 364-8608

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 08:30:15PM -0700, Aaron Bannert wrote:
> Agreed, but instead of adding sleep we should:
> a) call pthread_setconcurrency()
> b) devise a more life-like test
> c) not do anything cause it's working fine
> 
> testlockperf is really just trying to gauge the overhead from the
> mutex routines, and I think it does a very good job of that. The secondary
> purpose of testlockperf is to compare the old locking API to the new
> one.

Without enforcing the lock routines to be run in parallel, you aren't
testing the expected common case - therefore, it isn't a good test.
Yes, you could call pthread_setconcurrency(), but I think you are 
going to misjudge the appropriate number to pass to it (as I think 
there is no number that makes sense for all cases).  If you really 
want pthread_setconcurrency to equal the number of threads, you want 
to enforce a bound thread implementation (which is different than 
creating a thread as bound with a multiplexed thread implementation).

At this point, we should both shut up and get some numbers.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 08:13:25PM -0700, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> > I don't think it's a quirk of the thread library, I think it's
> > fully expected. For the sake of others, here's an excerpt from the
> > Solaris 8 pthread_setconcurrency(3THR) man page:
> 
> In testlockperf, you are assuming that all of the threads have 
> started and will compete for the locks.  In a M:N implementation, 
> this assumption is false.  You end up executing in serial rather
> than in parallel.  This only occurs because you never hit a
> user-scheduler entry point in testlockperf.  In the case of a MPM,
> you will be hitting them left and right.  =-)
> 
> Therefore, you need to devise a strategy within testlockperf to 
> ensure that all of the threads are ready to compete before 
> continuing the test.  The suggested sleep is one way - condition
> variables *may* be possible, but it isn't completely obvious to
> me how that would work.  -- justin

Agreed, but instead of adding sleep we should:
a) call pthread_setconcurrency()
b) devise a more life-like test
c) not do anything cause it's working fine

testlockperf is really just trying to gauge the overhead from the
mutex routines, and I think it does a very good job of that. The secondary
purpose of testlockperf is to compare the old locking API to the new
one.

> P.S. If you are running a site where you get 50,000 hits a minute,
> you shouldn't have MRPC at 10,000.  I'd be curious to see what
> cnet runs with.

You're not going to get 50,000 hits a minute on any box that only has
~32,000 ports and Minimum Segment Length set to anything normal (like
2 minutes). My default Sol8 install can only take down 32k (non-keepalive)
hits in 4 minutes before all the sockets are sitting in TIME_WAIT.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> I don't think it's a quirk of the thread library, I think it's
> fully expected. For the sake of others, here's an excerpt from the
> Solaris 8 pthread_setconcurrency(3THR) man page:

In testlockperf, you are assuming that all of the threads have 
started and will compete for the locks.  In a M:N implementation, 
this assumption is false.  You end up executing in serial rather
than in parallel.  This only occurs because you never hit a
user-scheduler entry point in testlockperf.  In the case of a MPM,
you will be hitting them left and right.  =-)

Therefore, you need to devise a strategy within testlockperf to 
ensure that all of the threads are ready to compete before 
continuing the test.  The suggested sleep is one way - condition
variables *may* be possible, but it isn't completely obvious to
me how that would work.  -- justin

P.S. If you are running a site where you get 50,000 hits a minute,
you shouldn't have MRPC at 10,000.  I'd be curious to see what
cnet runs with.

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> I'm saying that it should never be used.  Simple.  You can't use
> that call properly in any real-world case - just like I don't think 
> you should call sched_yield ever.  You are attempting to solve a 
> problem that is best solved somewhere else - the base operating 
> system.

I aim to prove that there are cases where it is useful. I do not
think that sched_yield should be used, but that's a whole different
story (but I do think we should have a thread_yield for the sake
of netware and other totally userspace thread implementations
-- not to stir up the fire any more ;)

> The testlock case doesn't matter because it never hits any of the 
> Solaris-defined entry points.  This is a quirk in the OS and I see 
> no reason to work around it.  If you want to make testlock do the 
> right thing with the Solaris LWP model, use a reader/writer lock
> to synchronize the starting of the threads.  This way you guarantee 
> that all threads are started before you start execution of the 
> tight exclusive loop (which is something that testlock doesn't do 
> now).  You are assuming that the threads are created in parallel -
> nowhere is that ordering is guaranteed.

I don't think it's a quirk of the thread library, I think it's
fully expected. For the sake of others, here's an excerpt from the
Solaris 8 pthread_setconcurrency(3THR) man page:

DESCRIPTION
     Unbound threads in a process may or may not be  required  to
     be  simultaneously active. By default, the threads implemen-
     tation ensures that  a  sufficient  number  of  threads  are
     active  so  that  the process can continue to make progress.
     While this conserves system resources, it  may  not  produce
     the most effective level of concurrency.

     The  pthread_setconcurrency() function allows an application
     to  inform  the  threads  implementation of its desired con-
     currency level, new_level. The actual level  of  concurrency
     provided  by the implementation as a result of this function
     call is unspecified.

...

Although that is a very vague description of the mechanics of this
call, it does make it clear that the initial settings may not
be desired in all cases.

> > In consideration of your statement here I spend some time reading
> > the Solaris 8 libpthread source. On that platform your statement
> > here is false. Calling pthread_setconcurrency (or thr_setconcurrency
> > for that matter) can only change the number of multiplexed LWPs in
> > two ways: either not at all, or by increasing the number. I see
> > no way that it acts as a ceiling.
> 
> Yes, you are correct and I was wrong - I reread the Solaris Internals 
> book on my flight back to LAX today.  It isn't a ceiling.  However, 
> the case of creating too many LWPs is completely valid and is brought 
> up many times in their discussion of LWPs versus a bound thread model.
> Kernel threads are very expensive in Solaris and part of the reason 
> that it handles threads well is because it multiplexes the kernel 
> threads efficiently.  No other OS I have seen handles threads as
> gracefully as Solaris.

Creating too many LWPs may be a problem, and is something I intend
on looking into. I do however feel this is something the application
writer is going to have to deal with case-by-case.

In my experimentations with setconcurrency I have arrived at some
conclusions (*on Solaris8):

- setconcurrency(0) has not affect on the number of LWPs.
- setconcurrency(n) will create new LWPs if n > current_num_lwps
                     else it will have no effect on the number of LWPs.
- if you set it too high, you will suffer performance
- if you set it too low, you will either not take advantage of other CPUs,
  or you will not see it migrate tohe load to other CPUs until the "LWP
  creation agent" decides it's time to do so.


> I believe SUSv2 called it a "hint" for the general case.  However, in 
> this specific implementation (multiplexed kernel threads), it is not 
> a hint.  It is a request to have that many LWPs.  If you disagree
> with that statement, please look at the code again.

I was very clear in my previous message, and I have restated it in the
above statement. I was refuting the comment you made saying it was a
"command" and not a "hint". It is indeed a hint, only that in the
case where you ask for more LWPs than are currently allocated, it
will *attempt* to create more. In _all other cases_ it will simply
ignore the number you give it. It is not a ceiling.

> I pointed out that number (simultaneous requests) is a completely 
> bogus number to use when dealing with multiplexed kernel threads.
> This poor choice is why I don't think this call belongs in APR at all.  
> If you would care to claim that the number of simultaneous requests is
> the correct number in the context of a multiplexed thread model for
> worker, I would be delighted to hear why - you haven't offerred any 
> proof as to its validity.  I indicated why I thought that number was
> wrong.  I'll repeat it again with a bit more of a technical 
> explanation.

As I said at the beginning of this thread, I'd like to use this
call in more places than the worker MPM. I am not sure if this
will provide a benefit to the worker MPM, but if it does than
that is a good starting place.

> Creating all user threads as bound (what you are suggesting for 
> worker by calling pthread_setconcurrency with that value) in a 
> multiplexed thread model works against the thread model rather than 
> with it - this indicates a clash in design.  You want a bound thread 
> library, but refuse to use a bound thread library.  

It's actually worse than creating them as bound. In most cases a bound
thread has an early exit point to the system call in the userspace
implementation. Having a pool of LWPs available to a group of userspace
threads means that they have to be assigned. Bound means you get one
LWP forever.

> Ideally, most of worker MPM's time will be spent dealing with I/O, so
> there is no need to have spurious kernel threads when in such a usage
> pattern.  Solaris has a number of safeguards that will ensure that any
> runnable thread (kernel or user) will run as quickly as it can and it 
> will only create as many kernel threads as are actually dictated by
> the load (if there are really 8 threads ready to run, 8 execution
> contexts will be available).
> 
> With "scheduler activations" (Solaris 2.6+), when a user thread is 
> about to block and other user threads are waiting to execute, the
> running LWP will pass that unbound (but now blocked) thread off to 
> an idle LWP (via doors).  If no free LWPs are available (all LWPs 
> are blocked or executing), a new LWP is spawned (via SIGWAITING) 
> and the now-blocked unbound user thread is transferred.
> 
> This blocked user thread will resume via what Solaris calls "user 
> thread activation" - shared memory and a door call which indicates to 
> the kernel thread when a user thread is ready for execution (i.e. 
> needs the LWP active now because whatever blocked it has now been
> unblocked).  So as soon as the message is sent, the kernel will 
> reschedule the appropriate LWP.
> 
> Okay, back to the original LWP that the user thread was on - it has 
> time left on its original quantum because its user thread was about 
> to end prematurely, it then searches for a waiting unbound thread to
> execute in the remainder of its time.
> 
> In the common case of a user thread blocking with a free LWP already 
> created, you have saved a kernel context switch (the running LWP 
> sticks the user thread in an idle LWP by itself) - this is why this 
> M*N implementation can be faster than bound threads.  The context
> switch is free and the responsiveness is thus higher.  This also 
> causes it to create kernel threads as needed.  
> 
> The entire idea of a multiplexed kernel thread model (such as 
> Solaris) is to minimize the number of actual kernel threads and 
> increase responsiveness.  You would be circumventing that 
> decision by creating bound kernel threads that may not be 
> actually required due to the actual execution pattern of the code.  
> You will also decrease responsiveness because switching between 
> threads now becomes a kernel issue rather than a cheap user-space 
> issue (which is what Solaris wants to do by default).  However,
> you do this in a library that was optimized for mulitple 
> user-space threads not bound threads.
> 
> I believe if you really want a bound thread implementation, you should
> tell the OS you want it - not muck around with an indeterminate API to 
> do so that directly circumvents the scheduling/balancing process.

I don't want a bound thread impl, or I would have done that with the
thread attribute at creation time. I want the threads to ramp up fast
and I want them to migrate to other CPUs quickly.

> > There you go again with this "OS scheduler" thing that I've never heard
> > of. 10 seconds to stabilize is rather long when you consider I have
> > already served O(5000) requests.
> 
> You are really attempting to make this a personal argument here by
> attacking me.  I think this is completely uncalled for and 
> inappropriate.

I apoligise for the more snide comments made in my previous message.
They were perhaps inappropriate in this forum. I do however expect
this discussion to narrow in on the facts and come to a rational
conclusion instead of lingering on vague undefined concepts.

> 10 seconds isn't a long time for a server that will be up for months 
> or years.  And, as you said, you pulled that number (10 seconds) out 
> of thin air.  If you can substantiate it with real results, please
> provide them.  I don't consider a case of a 10 second delay for the 
> OS to properly balance itself with a particular thread model an issue.
> And, what is the impact of not having enough LWPs initially?  Were
> you testing on a SMP or UP box?  What was the type of CPU load that
> was being performed before it was balanced (usr, sys, or iowait)?

Unfortunately it may not be true that the server will be up for months or
years. In the best of cases we can hope for a MaxRequestsPerChild to be
infinite, but the reality is that 3rd party modules (and even httpd)
may leak memory. IIRC, the default MaxRequestsPerChild is 10000.
If it is taking me 5000 requests to reach a steady state, we are spending
half our time trying to ramp up before having to start all over again.

> You also haven't mentioned how many LWPs it stabilized at after
> 10 seconds?  Did Solaris choose to add a LWP for each user thread?  
> I have a feeling it wouldn't, but I may be wrong.  -- justin

I'll follow up this reply with some real numbers.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sat, Sep 15, 2001 at 04:43:39PM -0700, Aaron Bannert wrote:

> > If you create too many LWPs, you will lose a lot of optimizations 
> > that are present in Solaris (i.e. handover of a mutex to another 
> > thread in the same LWP - as discussed with bpane on dev@httpd 
> > recently).
> 
> Of course, and that is something the caller needs to take into consideration.
> I'm not forcing you to use it, I just think it needs to be available.

I'm saying that it should never be used.  Simple.  You can't use
that call properly in any real-world case - just like I don't think 
you should call sched_yield ever.  You are attempting to solve a 
problem that is best solved somewhere else - the base operating 
system.

The testlock case doesn't matter because it never hits any of the 
Solaris-defined entry points.  This is a quirk in the OS and I see 
no reason to work around it.  If you want to make testlock do the 
right thing with the Solaris LWP model, use a reader/writer lock
to synchronize the starting of the threads.  This way you guarantee 
that all threads are started before you start execution of the 
tight exclusive loop (which is something that testlock doesn't do 
now).  You are assuming that the threads are created in parallel -
nowhere is that ordering is guaranteed.

> In consideration of your statement here I spend some time reading
> the Solaris 8 libpthread source. On that platform your statement
> here is false. Calling pthread_setconcurrency (or thr_setconcurrency
> for that matter) can only change the number of multiplexed LWPs in
> two ways: either not at all, or by increasing the number. I see
> no way that it acts as a ceiling.

Yes, you are correct and I was wrong - I reread the Solaris Internals 
book on my flight back to LAX today.  It isn't a ceiling.  However, 
the case of creating too many LWPs is completely valid and is brought 
up many times in their discussion of LWPs versus a bound thread model.
Kernel threads are very expensive in Solaris and part of the reason 
that it handles threads well is because it multiplexes the kernel 
threads efficiently.  No other OS I have seen handles threads as
gracefully as Solaris.

My guess is that in Solaris 9 they reworked the kernel thread API to 
be much faster than before so that it achieves similar 
creation/switching/destruction times to the user-space LWP threads.  
If they did that, I believe that it then makes sense to switch to 
bound threads by default.  (I do need to double check that they have 
switched to a bound threads by default in Solaris 9.)

> >                                          This is not a hint, but a 
> > command.  (Yes, the man page for Solaris says that it is a hint, 
> > but it treats it as a command.)
> 
> Sorry, but that's just BS, and I don't know where you get off making such
> bold unfounded statements. Please just go read the source, they match
> the man pages.

I believe SUSv2 called it a "hint" for the general case.  However, in 
this specific implementation (multiplexed kernel threads), it is not 
a hint.  It is a request to have that many LWPs.  If you disagree
with that statement, please look at the code again.

> > - Let the programmer decide.  Awfully bad choice.  Who knows
> > how the system is setup?  What are you optimizing for?
> 
> This is the only choice I proposed, I don't know what the heck you are
> arguing about in these other things. Of course let the programmer decide,
> that's why it's an API!
> 
> I just gave you an example where I would use it: in the worker MPM.
> In that case it would be the number of simultaneous requests I expect
> to serve.

I pointed out that number (simultaneous requests) is a completely 
bogus number to use when dealing with multiplexed kernel threads.
This poor choice is why I don't think this call belongs in APR at all.  
If you would care to claim that the number of simultaneous requests is
the correct number in the context of a multiplexed thread model for
worker, I would be delighted to hear why - you haven't offerred any 
proof as to its validity.  I indicated why I thought that number was
wrong.  I'll repeat it again with a bit more of a technical 
explanation.

Creating all user threads as bound (what you are suggesting for 
worker by calling pthread_setconcurrency with that value) in a 
multiplexed thread model works against the thread model rather than 
with it - this indicates a clash in design.  You want a bound thread 
library, but refuse to use a bound thread library.  

Ideally, most of worker MPM's time will be spent dealing with I/O, so
there is no need to have spurious kernel threads when in such a usage
pattern.  Solaris has a number of safeguards that will ensure that any
runnable thread (kernel or user) will run as quickly as it can and it 
will only create as many kernel threads as are actually dictated by
the load (if there are really 8 threads ready to run, 8 execution
contexts will be available).

With "scheduler activations" (Solaris 2.6+), when a user thread is 
about to block and other user threads are waiting to execute, the
running LWP will pass that unbound (but now blocked) thread off to 
an idle LWP (via doors).  If no free LWPs are available (all LWPs 
are blocked or executing), a new LWP is spawned (via SIGWAITING) 
and the now-blocked unbound user thread is transferred.

This blocked user thread will resume via what Solaris calls "user 
thread activation" - shared memory and a door call which indicates to 
the kernel thread when a user thread is ready for execution (i.e. 
needs the LWP active now because whatever blocked it has now been
unblocked).  So as soon as the message is sent, the kernel will 
reschedule the appropriate LWP.

Okay, back to the original LWP that the user thread was on - it has 
time left on its original quantum because its user thread was about 
to end prematurely, it then searches for a waiting unbound thread to
execute in the remainder of its time.

In the common case of a user thread blocking with a free LWP already 
created, you have saved a kernel context switch (the running LWP 
sticks the user thread in an idle LWP by itself) - this is why this 
M*N implementation can be faster than bound threads.  The context
switch is free and the responsiveness is thus higher.  This also 
causes it to create kernel threads as needed.  

The entire idea of a multiplexed kernel thread model (such as 
Solaris) is to minimize the number of actual kernel threads and 
increase responsiveness.  You would be circumventing that 
decision by creating bound kernel threads that may not be 
actually required due to the actual execution pattern of the code.  
You will also decrease responsiveness because switching between 
threads now becomes a kernel issue rather than a cheap user-space 
issue (which is what Solaris wants to do by default).  However,
you do this in a library that was optimized for mulitple 
user-space threads not bound threads.

I believe if you really want a bound thread implementation, you should
tell the OS you want it - not muck around with an indeterminate API to 
do so that directly circumvents the scheduling/balancing process.

> There you go again with this "OS scheduler" thing that I've never heard
> of. 10 seconds to stabilize is rather long when you consider I have
> already served O(5000) requests.

You are really attempting to make this a personal argument here by
attacking me.  I think this is completely uncalled for and 
inappropriate.

10 seconds isn't a long time for a server that will be up for months 
or years.  And, as you said, you pulled that number (10 seconds) out 
of thin air.  If you can substantiate it with real results, please
provide them.  I don't consider a case of a 10 second delay for the 
OS to properly balance itself with a particular thread model an issue.
And, what is the impact of not having enough LWPs initially?  Were
you testing on a SMP or UP box?  What was the type of CPU load that
was being performed before it was balanced (usr, sys, or iowait)?

You also haven't mentioned how many LWPs it stabilized at after
10 seconds?  Did Solaris choose to add a LWP for each user thread?  
I have a feeling it wouldn't, but I may be wrong.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Fri, Sep 14, 2001 at 06:33:47PM -0700, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 04:21:51PM -0700, Aaron Bannert wrote:
> > Why would this circumvent the OS scheduler at all? In all cases it
> > is a *hint*. Please be more precise.
> > 
> > I think I showed you an example awhile ago where compute-bound threads
> > behave drastically different depending on the operating system. In
> > the case of solaris, a computationally intensive thread that makes no
> > system calls* will not automatically yield a mutex when entering/exiting
> > a critical section, unless pthread_setconcurrency() is called.
> 
> That statement isn't necessarily correct.  What actually happens
> is that the user scheduler in Solaris never gets executed because
> none of the entry points as defined by the OS (i.e. system calls)
> get executed to trigger the user scheduler's activity during the
> compute-bound function call.  It isn't that it doesn't yield the
> mutex - it is that there is no other thread to yield to as the 
> scheduler on Solaris gives the thread a chance to run *before*
> launching the next thread.  This is a conscious decision on Sun's 
> part when designing their scheduler for Solaris (up to but not 
> including 9).

This may be a more accurate description, but I think we're talking
about the same thing here.

> If you create too many LWPs, you will lose a lot of optimizations 
> that are present in Solaris (i.e. handover of a mutex to another 
> thread in the same LWP - as discussed with bpane on dev@httpd 
> recently).

Of course, and that is something the caller needs to take into consideration.
I'm not forcing you to use it, I just think it needs to be available.

>             If you don't create enough LWPs, you may enter a 
> condition where the scheduler refuses to balance the processes 
> correctly (it also acts as a ceiling).

In consideration of your statement here I spend some time reading
the Solaris 8 libpthread source. On that platform your statement
here is false. Calling pthread_setconcurrency (or thr_setconcurrency
for that matter) can only change the number of multiplexed LWPs in
two ways: either not at all, or by increasing the number. I see
no way that it acts as a ceiling.

For the curious, the code in question begins at:
/usr/src/lib/libthread/common/thread.c:332
(pthread_* is built on top of thr_*)

>                                         0 lets the OS determine
> the concurrency (on Solaris).

Again, on solaris 8, calling pthread_setconcurrency(0) has absolutely
no effect on the number of LWPs (which, I might add, is what it states
on the man page).

> By setting a value, you are attempting to circumvent the OS 
> scheduler.  If you ask it to set the concurrency on Solaris, it 
> *will* create enough LWPs to equal that concurrency (as you
> create threads to be paired with LWPs).

"I do not think that means what you think it means."

I still don't know what you mean by "circumvent the OS scheduler",
but the second statement is correct, and my point is that is exactly
what I want it to do.

>                                          This is not a hint, but a 
> command.  (Yes, the man page for Solaris says that it is a hint, 
> but it treats it as a command.)

Sorry, but that's just BS, and I don't know where you get off making such
bold unfounded statements. Please just go read the source, they match
the man pages.

> Talking about other OSes besides Solaris is moot because they don't 
> implement a M*N scheduling strategy.  With a bound thread 
> implementation, pthread_setconcurrency is a no-op (what else can
> it do?).  It can only be effective in the case of a LWP-like 
> (multiplexing a kernel thread) scheduling strategy.

Great. This is what I said at the start of this thread. So do you have
a good reason to keep it out of APR or not?

> Furthermore, I think that any values that you may pass into
> pthread_setconcurrency are inherently wrong.  What values will
> you use to set this?  The number of threads?  The number of CPUs?
> Let the programmer decide?  Let the user decide?  IMHO, all of these 
> are bad choices:

[self-fulfilling answers omitted]

> - Let the programmer decide.  Awfully bad choice.  Who knows
> how the system is setup?  What are you optimizing for?

This is the only choice I proposed, I don't know what the heck you are
arguing about in these other things. Of course let the programmer decide,
that's why it's an API!

I just gave you an example where I would use it: in the worker MPM.
In that case it would be the number of simultaneous requests I expect
to serve.

> So, what do I think the correct solution is?  Let the OS decide
> (exactly what it does now).  The OS has access to much better 
> information to make these decisions (i.e. load averages, I/O wait, 
> other processes, num CPUs, etc.).  The goal of the OS is to balance 
> competing processes.  Circumventing the OS scheduler by forcing it 
> to create too many or too few LWPs is the wrong thing.

Oh, just stop that. You can't keep saying "circumventing the OS scheduler"
when that doesn't mean anything! You surely don't mean "somehow getting
around the process scheduler", so just quit it! We're not hacking the
kernel here, we're using fully published POSIX APIs!

> The case of a compute-bound thread merely falls into a specific 
> trap on a specific OS with a specific thread model.  This case
> is typically evident in benchmarks not the real-world.  Most
> applications will enter a system call at *some* point.
> 
> > In a practical sense, when I was playing with the worker MPM I noticed
> > that under high load (maxing out the CPU) it took on the order of 10
> > seconds** for the number of LWPs to stablize.
> 
> I'll live with that - this is due to inherent OS scheduler 
> characteristics.  After 10 seconds, the system stabilizes - the 
> OS has performed its job.  Is there any evidence that this value
> that it stabilized at is incorrect?  What formula would you have
> used to set that number?  Any "hint" that we may give it may end 
> up back-firing rather than helping.

There you go again with this "OS scheduler" thing that I've never heard
of. 10 seconds to stabilize is rather long when you consider I have
already served O(5000) requests.

> In fact, the best solution may be to provide a configure-time 
> option to help the user select the "right" thread model on
> Solaris (i.e. /usr/lib/libthread.so or /usr/lib/lwp/libthread.so).
> You can recommend using the "alternative" thread model for
> certain types of compute-bound applications.  (However, be careful
> on Solaris 9 as they are reversed.)  -- justin

The configure-time option you're talking about is LDFLAGS. You can
also do it at runtime with LD_PRELOAD.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Fri, Sep 14, 2001 at 04:21:51PM -0700, Aaron Bannert wrote:
> Why would this circumvent the OS scheduler at all? In all cases it
> is a *hint*. Please be more precise.
> 
> I think I showed you an example awhile ago where compute-bound threads
> behave drastically different depending on the operating system. In
> the case of solaris, a computationally intensive thread that makes no
> system calls* will not automatically yield a mutex when entering/exiting
> a critical section, unless pthread_setconcurrency() is called.

That statement isn't necessarily correct.  What actually happens
is that the user scheduler in Solaris never gets executed because
none of the entry points as defined by the OS (i.e. system calls)
get executed to trigger the user scheduler's activity during the
compute-bound function call.  It isn't that it doesn't yield the
mutex - it is that there is no other thread to yield to as the 
scheduler on Solaris gives the thread a chance to run *before*
launching the next thread.  This is a conscious decision on Sun's 
part when designing their scheduler for Solaris (up to but not 
including 9).

If you create too many LWPs, you will lose a lot of optimizations 
that are present in Solaris (i.e. handover of a mutex to another 
thread in the same LWP - as discussed with bpane on dev@httpd 
recently).  If you don't create enough LWPs, you may enter a 
condition where the scheduler refuses to balance the processes 
correctly (it also acts as a ceiling).  0 lets the OS determine
the concurrency (on Solaris).

By setting a value, you are attempting to circumvent the OS 
scheduler.  If you ask it to set the concurrency on Solaris, it 
*will* create enough LWPs to equal that concurrency (as you
create threads to be paired with LWPs).  This is not a hint, but a 
command.  (Yes, the man page for Solaris says that it is a hint, 
but it treats it as a command.)

Talking about other OSes besides Solaris is moot because they don't 
implement a M*N scheduling strategy.  With a bound thread 
implementation, pthread_setconcurrency is a no-op (what else can
it do?).  It can only be effective in the case of a LWP-like 
(multiplexing a kernel thread) scheduling strategy.

Furthermore, I think that any values that you may pass into
pthread_setconcurrency are inherently wrong.  What values will
you use to set this?  The number of threads?  The number of CPUs?
Let the programmer decide?  Let the user decide?  IMHO, all of these 
are bad choices:

- Use number of threads.  When concerning ourselves with the
Solaris M*N scheduler, this is horrific because we have now
lost the optimizations and may have created too many LWPs.  When you
use a bound thread library on Solaris, the overhead of the (now)
useless optimizations don't occur.  So, if you want to use the number 
of threads on Solaris, use the bound thread library instead of the 
LWP thread library.  This obviates the need for pthread_setconcurrency,
since by definition all threads are kernel threads.

- Use number of CPUs.  How would you get this number?  Also, it
is a bit of a red herring because it is not a good number because
your application may be sharing resources with other processes.
If you are primarily I/O-bound, you have just created too many
LWPs and have to incur their overhead because most of the time
the threads are going to be idle waiting for IO.

- Let the programmer decide.  Awfully bad choice.  Who knows
how the system is setup?  What are you optimizing for?

- Let the user decide via a configuration option (like a MPM
directive).  I don't think that we can expect the user to 
fully understand the meaning of this value.  More often then not,
they may set it to either of the wrong values described above.

So, what do I think the correct solution is?  Let the OS decide
(exactly what it does now).  The OS has access to much better 
information to make these decisions (i.e. load averages, I/O wait, 
other processes, num CPUs, etc.).  The goal of the OS is to balance 
competing processes.  Circumventing the OS scheduler by forcing it 
to create too many or too few LWPs is the wrong thing.

The case of a compute-bound thread merely falls into a specific 
trap on a specific OS with a specific thread model.  This case
is typically evident in benchmarks not the real-world.  Most
applications will enter a system call at *some* point.

> In a practical sense, when I was playing with the worker MPM I noticed
> that under high load (maxing out the CPU) it took on the order of 10
> seconds** for the number of LWPs to stablize.

I'll live with that - this is due to inherent OS scheduler 
characteristics.  After 10 seconds, the system stabilizes - the 
OS has performed its job.  Is there any evidence that this value
that it stabilized at is incorrect?  What formula would you have
used to set that number?  Any "hint" that we may give it may end 
up back-firing rather than helping.

In fact, the best solution may be to provide a configure-time 
option to help the user select the "right" thread model on
Solaris (i.e. /usr/lib/libthread.so or /usr/lib/lwp/libthread.so).
You can recommend using the "alternative" thread model for
certain types of compute-bound applications.  (However, be careful
on Solaris 9 as they are reversed.)  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Fri, Sep 14, 2001 at 03:49:59PM -0700, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 03:44:48PM -0700, Aaron Bannert wrote:
> > I'd like to propose we add a call that gives a hint to the OS as to
> > the level of concurrency we wish to have. This would mirror
> > pthread_setconcurrency(), and would be a simple call to that on
> > operating systems that have it available. On other platforms it
> > would be simple noop.
> 
> The problem with this is that we are going to circumvent the OS
> scheduler which I think is a bad idea - unless we can show where the 
> OS falls down on the job (except in the pedantic case of testthread).
> -- justin

Why would this circumvent the OS scheduler at all? In all cases it
is a *hint*. Please be more precise.

I think I showed you an example awhile ago where compute-bound threads
behave drastically different depending on the operating system. In
the case of solaris, a computationally intensive thread that makes no
system calls* will not automatically yield a mutex when entering/exiting
a critical section, unless pthread_setconcurrency() is called.

In a practical sense, when I was playing with the worker MPM I noticed
that under high load (maxing out the CPU) it took on the order of 10
seconds** for the number of LWPs to stablize.

*Really, these are calls that check the userspace run-queue.

**A number I pulled out of my arse...it took awhile at least. Once it
stabalized, the system could handle the load with a few extra cycles to spare.
I'm not sure if it's significant for the worker MPM, but I can show cases
where it is significant.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Fri, Sep 14, 2001 at 03:44:48PM -0700, Aaron Bannert wrote:
> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.

The problem with this is that we are going to circumvent the OS
scheduler which I think is a bad idea - unless we can show where the 
OS falls down on the job (except in the pedantic case of testthread).
-- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Ryan Bloom <rb...@covalent.net>.

On Friday 14 September 2001 03:44 pm, Aaron Bannert wrote:

+1

Ryan

> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.
>
> Give me some +1s and I'll submit a patch.
>
> -aaron

-- 

______________________________________________________________
Ryan Bloom				rbb@apache.org
Covalent Technologies			rbb@covalent.net
--------------------------------------------------------------

Re: Solaris 8 and 9 thread libraries was Re: [proposal] apr_thread_setconcurrency()

Posted by Brian Pane <bp...@pacbell.net>.

Justin Erenkrantz wrote:

>On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
>
>>Yup.   More precisely /usr/lib/lwp/libthread.so is the "alternate"
>>version and /usr/lib/libthread.so is the "default" version.  They
>>are binary compatible (as far as we care) - therefore the
>>LD_LIBRARY_PATH trick works.  With Solaris 8 (first one to have
>>this alternate version), the default is to use LWPs and the 
>>alternate is bound threads.  AFAIK, Solaris 9 switches them.
>>
>
>% uname -srvm
>SunOS 5.9 Beta sun4u
>% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
>-rwxr-xr-x   1 root     bin       129168 Jun 20 10:40 /usr/lib/libthread.so.1
>lrwxrwxrwx   1 root     root          17 Aug 29 16:41 /usr/lib/lwp/libthread.so.1 -> ../libthread.so.1
>
>% uname -srvm
>SunOS 5.8 Generic_108529-09 i86pc
>jerenkrantz@boris% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
>-rwxr-xr-x   1 root     bin       170724 Jan 24  2001 /usr/lib/libthread.so.1
>-rwxr-xr-x   1 root     bin       108620 Feb 22  2001 /usr/lib/lwp/libthread.so.1
>
>As you can see, the alternate threading library on Solaris 9 just points
>to /usr/lib/libthread.so.1.  Based on what I can see, LWPs are still 
>present, but the performance characteristics and functions executed 
>*looks* like it is a bound thread implementation.  I don't have access 
>
That sounds like what I'd expect, based on earlier descriptions of how
Solaris 9 would have a single-layer thread model by default: presumably
they kept the LWP architecture (which seems rather an integral part of
the kernel's process management) and modified the thread and pthread libs
to make user-layer threads automatically bound to LWPs.

--Brian

Solaris 8 and 9 thread libraries was Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
> Yup.   More precisely /usr/lib/lwp/libthread.so is the "alternate"
> version and /usr/lib/libthread.so is the "default" version.  They
> are binary compatible (as far as we care) - therefore the
> LD_LIBRARY_PATH trick works.  With Solaris 8 (first one to have
> this alternate version), the default is to use LWPs and the 
> alternate is bound threads.  AFAIK, Solaris 9 switches them.

% uname -srvm
SunOS 5.9 Beta sun4u
% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
-rwxr-xr-x   1 root     bin       129168 Jun 20 10:40 /usr/lib/libthread.so.1
lrwxrwxrwx   1 root     root          17 Aug 29 16:41 /usr/lib/lwp/libthread.so.1 -> ../libthread.so.1

% uname -srvm
SunOS 5.8 Generic_108529-09 i86pc
jerenkrantz@boris% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
-rwxr-xr-x   1 root     bin       170724 Jan 24  2001 /usr/lib/libthread.so.1
-rwxr-xr-x   1 root     bin       108620 Feb 22  2001 /usr/lib/lwp/libthread.so.1

As you can see, the alternate threading library on Solaris 9 just points
to /usr/lib/libthread.so.1.  Based on what I can see, LWPs are still 
present, but the performance characteristics and functions executed 
*looks* like it is a bound thread implementation.  I don't have access 
to the source code, so I have no clue what is going on.  -- justin

Re: AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()

Posted by "Victor J. Orlikowski" <vj...@dulug.duke.edu>.

On Sunday, 16 Sep 2001, at 20:05:00,
Aaron Bannert wrote:
> I wouldn't be surprised if it did the same thing as Solaris on
> testlockperf. It remains to be seen, however, how the libs perform
> on Solaris/AIX under real-world scenarios (worker MPM) w/ and w/o
> setconcurrency, and on Uni and Multiprocessor machines.

Easy enough to test (w/out playing too much in the code) on AIX.
Check the AIXTHREAD_MNRATIO environment variable.

Victor
-- 
Victor J. Orlikowski   | The Wall is Down, But the Threat Remains!
==================================================================
orlikowski@apache.org  | vjo@dulug.duke.edu | vjo@us.ibm.com

Re: AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 07:44:58PM -0700, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:11:41PM -0700, Aaron Bannert wrote:
> > The only platforms that I know about that have a two-level thread model
> > are AIX and Solaris. The single-level thread libs ignore setconcurrency
> > because every thread is what solaris calls a "bound thread", or a kernel
> > scheduled entity (it gets it's own process slot). The only exceptions
> > to this rule are fully userspace thread libs, where setconcurrency is
> > inherently maximized at 1.
> 
> Oh, crap, you're right.  AIX has M:N threads by default in 4.3.1+.
> (Isn't it funny that IBM is adopting something that Sun is ditching?)
> 
> Okay, so, what does setconcurrency do on AIX?  How does testlockperf
> work on MP AIX boxes?  I bet it'd do the same bad things as Solaris
> does.  But, I know nothing about AIX.  -- justin

I wouldn't be surprised if it did the same thing as Solaris on
testlockperf. It remains to be seen, however, how the libs perform
on Solaris/AIX under real-world scenarios (worker MPM) w/ and w/o
setconcurrency, and on Uni and Multiprocessor machines.

Whew that's a lot of variables...

Someone want to send me an AIX SMP box? ;)

-aaron

AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 07:11:41PM -0700, Aaron Bannert wrote:
> The only platforms that I know about that have a two-level thread model
> are AIX and Solaris. The single-level thread libs ignore setconcurrency
> because every thread is what solaris calls a "bound thread", or a kernel
> scheduled entity (it gets it's own process slot). The only exceptions
> to this rule are fully userspace thread libs, where setconcurrency is
> inherently maximized at 1.

Oh, crap, you're right.  AIX has M:N threads by default in 4.3.1+.
(Isn't it funny that IBM is adopting something that Sun is ditching?)

Okay, so, what does setconcurrency do on AIX?  How does testlockperf
work on MP AIX boxes?  I bet it'd do the same bad things as Solaris
does.  But, I know nothing about AIX.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
> > But of course that case is not terribly relevant for something like
> > httpd-2.0 on a big SMP box, where the optimal case (of which there are
> > many dimentions) can not be known to the underlying thread/LWP creation
> > agent. That is the key issue at hand here. We, as _users_ of this API
> > would like to maximize each of {requests/second, time/request, number of
> > simultaneous connections} where the LWP creation agent is just trying to
> > get the work done with the least amount of context switching. The dials it
> > has to play with are numerous, and so it must perform a delicate linear
> > programming task in an attempt to meet the same goals as the application
> > programmer. I don't claim that setconcurrency is the way to reduce the
> > number of variables in this equation, but I do suggest we may want to
> > take this into consideration when trying to make our threaded algorithms
> > work the way we expect them to.
> 
> I just don't think it is going to get us what you want.  I think
> the net result with setconcurrency on Solaris with LWPs is to 
> circumvent its balancing algorithms so that it creates too many 
> LWPs.  I think this is the wrong way to attack this problem and 
> goes against the design of their thread library.  On all other 
> platforms (and with bound thread impl on Solaris), setconcurrency 
> is an ignored hint.  -- justin

The only platforms that I know about that have a two-level thread model
are AIX and Solaris. The single-level thread libs ignore setconcurrency
because every thread is what solaris calls a "bound thread", or a kernel
scheduled entity (it gets it's own process slot). The only exceptions
to this rule are fully userspace thread libs, where setconcurrency is
inherently maximized at 1.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 03:29:37PM -0700, Aaron Bannert wrote:
> (just a question, when you say "bound thread impl" you mean
> /usr/lib/lwp/libpthread, right? and the "LWP version" is the default
> /usr/lib/libpthread. Let me know if I have this backward. Maybe
> "LWP version" isn't the best name for the two libs, since they both
> do LWPs.)

Yup.   More precisely /usr/lib/lwp/libthread.so is the "alternate"
version and /usr/lib/libthread.so is the "default" version.  They
are binary compatible (as far as we care) - therefore the
LD_LIBRARY_PATH trick works.  With Solaris 8 (first one to have
this alternate version), the default is to use LWPs and the 
alternate is bound threads.  AFAIK, Solaris 9 switches them.

> I don't really see this as a scheduling quirk on Solaris. I think we
> would all agree that the same "work" is performed in each of the 4 tests,
> with or without sleeps, and with or without setconcurrency.  It is also
> obvious that since the same "work" is performed in each case, that the
> obvious winner for this made-up performance test is the one that does
> the work the fastest; which happens to be the one that creates no new
> LWPs and therefore minimizes the number of kernel context switches.

What happens is that when the sleeps are not present, three of the
threads do not have a chance to execute the test function (they are
stuck in the libc thread_start function).  They are monopolized by 
the one thread that raced to the beginning and started working.  The 
sleep allows all of the threads to hit the lock acquire within
testlockperf (remember that the code that spawns the threads has the
lock so the threads can't acquire it - they go to sleep).  Once the 
spawning thread releases the lock, all of the threads now wakeup
and compete.  Watch it with gdb and pstack on a MP box.  =-)

> But of course that case is not terribly relevant for something like
> httpd-2.0 on a big SMP box, where the optimal case (of which there are
> many dimentions) can not be known to the underlying thread/LWP creation
> agent. That is the key issue at hand here. We, as _users_ of this API
> would like to maximize each of {requests/second, time/request, number of
> simultaneous connections} where the LWP creation agent is just trying to
> get the work done with the least amount of context switching. The dials it
> has to play with are numerous, and so it must perform a delicate linear
> programming task in an attempt to meet the same goals as the application
> programmer. I don't claim that setconcurrency is the way to reduce the
> number of variables in this equation, but I do suggest we may want to
> take this into consideration when trying to make our threaded algorithms
> work the way we expect them to.

I just don't think it is going to get us what you want.  I think
the net result with setconcurrency on Solaris with LWPs is to 
circumvent its balancing algorithms so that it creates too many 
LWPs.  I think this is the wrong way to attack this problem and 
goes against the design of their thread library.  On all other 
platforms (and with bound thread impl on Solaris), setconcurrency 
is an ignored hint.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Sep 16, 2001 at 02:02:39PM -0700, Justin Erenkrantz wrote:

> This is on a 8-way box, right?  Those numbers look about right
> for the bound thread implementation.  However, the LWP version
> still looks like it isn't doing the right thing.

(just a question, when you say "bound thread impl" you mean
/usr/lib/lwp/libpthread, right? and the "LWP version" is the default
/usr/lib/libpthread. Let me know if I have this backward. Maybe
"LWP version" isn't the best name for the two libs, since they both
do LWPs.)

> I think you probably need this patch for the LWP version. 
> I'm not sure whether setconcurrency will produce the same 
> effect - it might.
> 
> I'm not going to commit this because I know it's the wrong thing 
> (should be condition vars I think), but it solves the scheduling 
> quirk on Solaris for now.  -- justin

I don't really see this as a scheduling quirk on Solaris. I think we
would all agree that the same "work" is performed in each of the 4 tests,
with or without sleeps, and with or without setconcurrency.  It is also
obvious that since the same "work" is performed in each case, that the
obvious winner for this made-up performance test is the one that does
the work the fastest; which happens to be the one that creates no new
LWPs and therefore minimizes the number of kernel context switches.

But of course that case is not terribly relevant for something like
httpd-2.0 on a big SMP box, where the optimal case (of which there are
many dimentions) can not be known to the underlying thread/LWP creation
agent. That is the key issue at hand here. We, as _users_ of this API
would like to maximize each of {requests/second, time/request, number of
simultaneous connections} where the LWP creation agent is just trying to
get the work done with the least amount of context switching. The dials it
has to play with are numerous, and so it must perform a delicate linear
programming task in an attempt to meet the same goals as the application
programmer. I don't claim that setconcurrency is the way to reduce the
number of variables in this equation, but I do suggest we may want to
take this into consideration when trying to make our threaded algorithms
work the way we expect them to.

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Sun, Sep 16, 2001 at 01:53:42PM -0700, Ian Holsman wrote:
> $ ./testlockperf
> APR Lock Performance Test
> ==============
> 
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
>     Initializing the apr_lock_t                             OK
>     Starting all the threads                                OK
> microseconds: 9634489 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
>     Starting all the threads                                OK
> microseconds: 7333845 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
>     Initializing the apr_lock_t                             OK
> microseconds: 11365100 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
> microseconds: 8443761 usec
> 
> $ export LD_LIBRARY_PATH=/usr/lib/lwp/
> 
> $ ./testlockperf
> APR Lock Performance Test
> ==============
> 
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
>     Initializing the apr_lock_t                             OK
>     Starting all the threads                                OK
> microseconds: 25322674 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
>     Starting all the threads                                OK
> microseconds: 23590762 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
>     Initializing the apr_lock_t                             OK
> microseconds: 23106303 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
> microseconds: 19515490 usec

This is on a 8-way box, right?  Those numbers look about right
for the bound thread implementation.  However, the LWP version
still looks like it isn't doing the right thing.

I think you probably need this patch for the LWP version. 
I'm not sure whether setconcurrency will produce the same 
effect - it might.

I'm not going to commit this because I know it's the wrong thing 
(should be condition vars I think), but it solves the scheduling 
quirk on Solaris for now.  -- justin

Index: testlockperf.c
===================================================================
RCS file: /home/cvs/apr/test/testlockperf.c,v
retrieving revision 1.4
diff -u -r1.4 testlockperf.c
--- testlockperf.c	2001/09/16 17:15:39	1.4
+++ testlockperf.c	2001/09/16 20:47:37
@@ -176,6 +176,7 @@
     }
     printf("OK\n");
 
+    apr_sleep(10000);
     time_start = apr_time_now();
     apr_lock_release(inter_lock);
 
@@ -226,6 +227,7 @@
     }
     printf("OK\n");
 
+    apr_sleep(10000);
     time_start = apr_time_now();
     apr_thread_mutex_unlock(thread_lock);
 
@@ -275,6 +277,7 @@
         return s1;
     }
 
+    apr_sleep(10000);
     time_start = apr_time_now();
     apr_lock_release(inter_rwlock);
 
@@ -323,6 +326,7 @@
         return s1;
     }
 
+    apr_sleep(10000);
     time_start = apr_time_now();
     apr_thread_rwlock_unlock(thread_rwlock);

Re: [proposal] apr_thread_setconcurrency()

Posted by Ian Holsman <Ia...@cnet.com>.

On Sun, 2001-09-16 at 11:21, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > > Oh...
> > > I ran the the testlockperf code on the 8-way box, with
> > > the pthread_setconcurrency calls commented out, and with
> > > the concurrency calls put in (setting them to 8).
> > > results are as follows
> 
> Could you rerun this test with LD_LIBRARY_PATH set like:
> 
> LD_LIBRARY_PATH=/usr/lib/lwp ./testlockperf
> 
> What do you see?  My results and comments:
> 
> http://www.apache.org/~jerenkrantz/testlockperf.html
> 
> Please read it if you have a chance.  -- justin
(I commented ut the setconcurreny calls)

$ ./testlockperf
APR Lock Performance Test
==============

apr_lock(INTRAPROCESS, MUTEX) Lock Tests
    Initializing the apr_lock_t                             OK
    Starting all the threads                                OK
microseconds: 9634489 usec
apr_thread_mutex_t Tests
    Initializing the apr_thread_mutex_t                     OK
    Starting all the threads                                OK
microseconds: 7333845 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
    Initializing the apr_lock_t                             OK
microseconds: 11365100 usec
apr_thread_mutex_t Tests
    Initializing the apr_thread_mutex_t                     OK
microseconds: 8443761 usec

$ export LD_LIBRARY_PATH=/usr/lib/lwp/

$ ./testlockperf
APR Lock Performance Test
==============

apr_lock(INTRAPROCESS, MUTEX) Lock Tests
    Initializing the apr_lock_t                             OK
    Starting all the threads                                OK
microseconds: 25322674 usec
apr_thread_mutex_t Tests
    Initializing the apr_thread_mutex_t                     OK
    Starting all the threads                                OK
microseconds: 23590762 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
    Initializing the apr_lock_t                             OK
microseconds: 23106303 usec
apr_thread_mutex_t Tests
    Initializing the apr_thread_mutex_t                     OK
microseconds: 19515490 usec


-- 
Ian Holsman
Performance Measurement & Analysis
CNET Networks    -    415 364-8608

Re: [proposal] apr_thread_setconcurrency()

Posted by Justin Erenkrantz <je...@ebuilt.com>.

On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > Oh...
> > I ran the the testlockperf code on the 8-way box, with
> > the pthread_setconcurrency calls commented out, and with
> > the concurrency calls put in (setting them to 8).
> > results are as follows

Could you rerun this test with LD_LIBRARY_PATH set like:

LD_LIBRARY_PATH=/usr/lib/lwp ./testlockperf

What do you see?  My results and comments:

http://www.apache.org/~jerenkrantz/testlockperf.html

Please read it if you have a chance.  -- justin

Re: [proposal] apr_thread_setconcurrency()

Posted by Aaron Bannert <aa...@clove.org>.

On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > +1 IF the number you set it to is a hint, and solaris can changes the
> > concurrency afterwards according to the load on the system/internal
> > guidelines.

This is how it appears to work according to the source. I'll try fooling
with worker and make sure it works the way I'm expecting.

> > I ran the the testlockperf code on the 8-way box, with
> > the pthread_setconcurrency calls commented out, and with
> > the concurrency calls put in (setting them to 8).
> > results are as follows
> > 
> > (without setconcurrency)
> > APR Lock Performance Test
> > ==============
> > 
> > apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> > microseconds: 9373710 usec
> > apr_thread_mutex_t Tests
> > microseconds: 7304314 usec
> > apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> > microseconds: 11247506 usec
> > apr_thread_mutex_t Tests
> > microseconds: 8148914 usec
> > 
> > (with pthread_setconcurrency(8) where you put the comments)
> > APR Lock Performance Test
> > ==============
> > 
> > apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> > microseconds: 20054346 usec
> > apr_thread_mutex_t Tests
> > microseconds: 16979410 usec
> > apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> microseconds: 247538114 usec
> apr_thread_mutex_t Tests
> microseconds: 250328270 usec

This is a perfect example of what happens when you *don't* set
the concurrency level (on solaris). What's really happening here
is the threads are not being interleaved, but instead they just
run their entire 1-million iteration loop, mutex locking and unlocking
included, without any concurrency. Try this patch and you'll
see what I mean:

[This patch is for illustrative purposes only, not for CVS]

Index: testlockperf.c
===================================================================
RCS file: /home/cvspublic/apr/test/testlockperf.c,v
retrieving revision 1.2
diff -u -r1.2 testlockperf.c
--- testlockperf.c	2001/09/15 05:23:55	1.2
+++ testlockperf.c	2001/09/15 23:55:28
@@ -103,11 +103,13 @@
 {
     int i;
 
+    printf("thread %p started\n", thd);
     for (i = 0; i < MAX_COUNTER; i++) {
         apr_lock_acquire(inter_lock);
         mutex_counter++;
         apr_lock_release(inter_lock);
     }
+    printf("thread %p done\n", thd);
     return NULL;
 }
 
@@ -115,11 +117,13 @@
 {
     int i;
 
+    printf("thread %p started\n", thd);
     for (i = 0; i < MAX_COUNTER; i++) {
         apr_thread_mutex_lock(thread_lock);
         mutex_counter++;
         apr_thread_mutex_unlock(thread_lock);
     }
+    printf("thread %p done\n", thd);
     return NULL;
 }
 
@@ -127,11 +131,13 @@
 {
     int i;
 
+    printf("thread %p started\n", thd);
     for (i = 0; i < MAX_COUNTER; i++) {
         apr_lock_acquire_rw(inter_rwlock, APR_WRITER);
         mutex_counter++;
         apr_lock_release(inter_rwlock);
     }
+    printf("thread %p done\n", thd);
     return NULL;
 }
 
@@ -139,11 +145,13 @@
 {
     int i;
 
+    printf("thread %p started\n", thd);
     for (i = 0; i < MAX_COUNTER; i++) {
         apr_thread_rwlock_wrlock(thread_rwlock);
         mutex_counter++;
         apr_thread_rwlock_unlock(thread_rwlock);
     }
+    printf("thread %p done\n", thd);
     return NULL;
 }
 

-aaron

Re: [proposal] apr_thread_setconcurrency()

Posted by Ian Holsman <ia...@cnet.com>.

Ian Holsman wrote:

> Aaron Bannert wrote:
> 
>> I'd like to propose we add a call that gives a hint to the OS as to
>> the level of concurrency we wish to have. This would mirror
>> pthread_setconcurrency(), and would be a simple call to that on
>> operating systems that have it available. On other platforms it
>> would be simple noop.
>>
>> Give me some +1s and I'll submit a patch.
>>
>> -aaron
>>
> 
> +1 IF the number you set it to is a hint, and solaris can changes the
> concurrency afterwards according to the load on the system/internal
> guidelines.
> 
> Oh...
> I ran the the testlockperf code on the 8-way box, with
> the pthread_setconcurrency calls commented out, and with
> the concurrency calls put in (setting them to 8).
> results are as follows
> 
> (without setconcurrency)
> APR Lock Performance Test
> ==============
> 
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
>     Initializing the apr_lock_t                             OK
>     Starting all the threads                                OK
> microseconds: 9373710 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
>     Starting all the threads                                OK
> microseconds: 7304314 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
>     Initializing the apr_lock_t                             OK
> microseconds: 11247506 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
> 
> microseconds: 8148914 usec
> 
> (with pthread_setconcurrency(8) where you put the comments)
> APR Lock Performance Test
> ==============
> 
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
>     Initializing the apr_lock_t                             OK
>     Starting all the threads                                OK
> microseconds: 20054346 usec
> apr_thread_mutex_t Tests
>     Initializing the apr_thread_mutex_t                     OK
>     Starting all the threads                                OK
> microseconds: 16979410 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
>     Initializing the apr_lock_t                             OK
> 

microseconds: 247538114 usec
apr_thread_mutex_t Tests
     Initializing the apr_thread_mutex_t                     OK
microseconds: 250328270 usec

(I didn't wait long enough)


> --It just sits at this point....
> (CVS Code is a couple of day old if that makes a difference)

Re: [proposal] apr_thread_setconcurrency()

Posted by Ian Holsman <ia...@cnet.com>.

Aaron Bannert wrote:

> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.
> 
> Give me some +1s and I'll submit a patch.
> 
> -aaron
> 

+1 IF the number you set it to is a hint, and solaris can changes the
concurrency afterwards according to the load on the system/internal
guidelines.

Oh...
I ran the the testlockperf code on the 8-way box, with
the pthread_setconcurrency calls commented out, and with
the concurrency calls put in (setting them to 8).
results are as follows

(without setconcurrency)
APR Lock Performance Test
==============

apr_lock(INTRAPROCESS, MUTEX) Lock Tests
     Initializing the apr_lock_t                             OK
     Starting all the threads                                OK
microseconds: 9373710 usec
apr_thread_mutex_t Tests
     Initializing the apr_thread_mutex_t                     OK
     Starting all the threads                                OK
microseconds: 7304314 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
     Initializing the apr_lock_t                             OK
microseconds: 11247506 usec
apr_thread_mutex_t Tests
     Initializing the apr_thread_mutex_t                     OK

microseconds: 8148914 usec

(with pthread_setconcurrency(8) where you put the comments)
APR Lock Performance Test
==============

apr_lock(INTRAPROCESS, MUTEX) Lock Tests
     Initializing the apr_lock_t                             OK
     Starting all the threads                                OK
microseconds: 20054346 usec
apr_thread_mutex_t Tests
     Initializing the apr_thread_mutex_t                     OK
     Starting all the threads                                OK
microseconds: 16979410 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
     Initializing the apr_lock_t                             OK

--It just sits at this point....
(CVS Code is a couple of day old if that makes a difference)