You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Aaron Bannert <aa...@clove.org> on 2001/09/15 00:44:48 UTC
[proposal] apr_thread_setconcurrency()
I'd like to propose we add a call that gives a hint to the OS as to
the level of concurrency we wish to have. This would mirror
pthread_setconcurrency(), and would be a simple call to that on
operating systems that have it available. On other platforms it
would be simple noop.
Give me some +1s and I'll submit a patch.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Ryan Bloom <rb...@covalent.net>.
> If you create too many LWPs, you will lose a lot of optimizations
> that are present in Solaris (i.e. handover of a mutex to another
> thread in the same LWP - as discussed with bpane on dev@httpd
> recently). If you don't create enough LWPs, you may enter a
> condition where the scheduler refuses to balance the processes
> correctly (it also acts as a ceiling). 0 lets the OS determine
> the concurrency (on Solaris).
>
> By setting a value, you are attempting to circumvent the OS
> scheduler. If you ask it to set the concurrency on Solaris, it
> *will* create enough LWPs to equal that concurrency (as you
> create threads to be paired with LWPs). This is not a hint, but a
> command. (Yes, the man page for Solaris says that it is a hint,
> but it treats it as a command.)
I don't understand how you can say this. According to single Unix:
"The pthread_setconcurrency() function allows an application to inform
the threads implementation of its desired concurrency level, new_level.
The actual level of concurrency provided by the implementation as a
result of this function call is unspecified."
If Solaris is using the setconcurrency value as a command, then it is
absolutely horked.
As for whether this is a valid thing to do because it circumvents the OS,
of course it's valid. The OS is written to be generic, because that is the
only way to write a useful OS. The programmer who is writing an
application knows better than the OS what the thread concurrency should
be for their application. Generalized code performs worse than code than
is written for a specific application as a general rule.
The OS has to use a slow start to find the best concurrancy, because
otherwise it will create too many LWP's. With a web server, we know better
than the OS.
Ryan
______________________________________________________________
Ryan Bloom rbb@apache.org
Covalent Technologies rbb@covalent.net
--------------------------------------------------------------
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> The testlock case doesn't matter because it never hits any of the
> Solaris-defined entry points. This is a quirk in the OS and I see
> no reason to work around it. If you want to make testlock do the
> right thing with the Solaris LWP model, use a reader/writer lock
> to synchronize the starting of the threads. This way you guarantee
> that all threads are started before you start execution of the
> tight exclusive loop (which is something that testlock doesn't do
> now). You are assuming that the threads are created in parallel -
> nowhere is that ordering is guaranteed.
I noticed that your new testlockperf.c does exactly that (testlock.c
doesn't). Do you still see the serialization on Solaris MP with
LWPs? I will try running it here and see what happens. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Brian Pane <bp...@pacbell.net>.
Aaron Bannert wrote:
>On Mon, Sep 17, 2001 at 10:17:16AM -0700, Brian Pane wrote:
>
>>>So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
>>>client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
>>>requests I expect to handle (aka the number of worker threads).
>>>
>>How were you handling 25 simultaneous requests with just
>>10 concurrent connnections in ab?
>>
>
>Bad wording on my part... It stabilized at the number of worker threads
>being used in the system.
>
The quick re-use of the workers in an ab test probably explains why
Solaris ended up creating one LWP per worker thread in this test.
But I wouldn't extend that observation to say that it's a good idea in
general to set the concurrency hint to the number of worker threads.
In the real world, each thread tends to spend a lot more time waiting
for I/O than it does during a stress test. If you're running a server
with 500 worker threads, you probably don't want want 500 LWPs.
>Since the worker queue is FIFO, all the worker threads are used fairly
>soon after they enter the queue. I'll be changing this to LIFO in the
>near future (per Dean's suggestion) for possible cache hits, etc...
>
My hypothesis is that the number of LWPs will drop to ~13 when you
do this: 10 for the concurrent connections, plus the 3 built-in ones.
--Brian
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Mon, Sep 17, 2001 at 10:17:16AM -0700, Brian Pane wrote:
> >So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
> >client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
> >requests I expect to handle (aka the number of worker threads).
> >
> How were you handling 25 simultaneous requests with just
> 10 concurrent connnections in ab?
Bad wording on my part... It stabilized at the number of worker threads
being used in the system.
Since the worker queue is FIFO, all the worker threads are used fairly
soon after they enter the queue. I'll be changing this to LIFO in the
near future (per Dean's suggestion) for possible cache hits, etc...
(My poor linux box doesn't push 25 simultaneous requests very well :)
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Brian Pane <bp...@pacbell.net>.
Aaron Bannert wrote:
>On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
>
>>On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
>>
>>>You also haven't mentioned how many LWPs it stabilized at after
>>>10 seconds? Did Solaris choose to add a LWP for each user thread?
>>>I have a feeling it wouldn't, but I may be wrong. -- justin
>>>
>>I'll follow up this reply with some real numbers.
>>
>
>Uniprocessor Solaris 8 (7/01) i86pc (Athlon)
>worker MPM
>ApacheBench with 10 concurrent requests to a very large shtml page.
>I'm getting around 150r/s average.
>
><IfModule worker.c>
>StartServers 1
>MaxClients 1
>MinSpareThreads 5
>MaxSpareThreads 75
>ThreadsPerChild 25
>MaxRequestsPerChild 0
></IfModule>
>
>The worker MPM has these userspace threads:
>
>main (signal handler) thread
>thread_starter
>1x listener_thread
>ThreadsPerChild number of worker_threads
>
>
>'top' is reporting 28 LWPs after hitting around 60,000 requests as fast as AB
>can go (I hit it for at least a few minutes).
>
>So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
>client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
>requests I expect to handle (aka the number of worker threads).
>
How were you handling 25 simultaneous requests with just
10 concurrent connnections in ab?
--Brian
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> > You also haven't mentioned how many LWPs it stabilized at after
> > 10 seconds? Did Solaris choose to add a LWP for each user thread?
> > I have a feeling it wouldn't, but I may be wrong. -- justin
>
> I'll follow up this reply with some real numbers.
Uniprocessor Solaris 8 (7/01) i86pc (Athlon)
worker MPM
ApacheBench with 10 concurrent requests to a very large shtml page.
I'm getting around 150r/s average.
<IfModule worker.c>
StartServers 1
MaxClients 1
MinSpareThreads 5
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>
The worker MPM has these userspace threads:
main (signal handler) thread
thread_starter
1x listener_thread
ThreadsPerChild number of worker_threads
'top' is reporting 28 LWPs after hitting around 60,000 requests as fast as AB
can go (I hit it for at least a few minutes).
So that's 25 ThreadsPerChild + 3 builtin threads (door server, door
client, reaper) = 28, so yeah, it stabalized to the number of simultaneous
requests I expect to handle (aka the number of worker threads).
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Ian Holsman <ia...@cnet.com>.
On Sun, 2001-09-16 at 20:13, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> > I don't think it's a quirk of the thread library, I think it's
> > fully expected. For the sake of others, here's an excerpt from the
> > Solaris 8 pthread_setconcurrency(3THR) man page:
>
> In testlockperf, you are assuming that all of the threads have
> started and will compete for the locks. In a M:N implementation,
> this assumption is false. You end up executing in serial rather
> than in parallel. This only occurs because you never hit a
> user-scheduler entry point in testlockperf. In the case of a MPM,
> you will be hitting them left and right. =-)
>
> Therefore, you need to devise a strategy within testlockperf to
> ensure that all of the threads are ready to compete before
> continuing the test. The suggested sleep is one way - condition
> variables *may* be possible, but it isn't completely obvious to
> me how that would work. -- justin
>
> P.S. If you are running a site where you get 50,000 hits a minute,
> you shouldn't have MRPC at 10,000. I'd be curious to see what
> cnet runs with.
on our heaviest day (the bombing) we we're getting ~7,500 HTML pages
a minute. assuming ~6 images per page we got ~50,000 hits a minute.
(on a single machine)
this wasn't a normal day, we don't normally do THAT much traffic.
we currently have Max Requests Per Child set at '512' on our 1.3
servers, mainly due to memory leaks.
..ian
--
Ian Holsman IanH@cnet.com
Performance Measurement & Analysis
CNET Networks - (415) 364-8608
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 08:30:15PM -0700, Aaron Bannert wrote:
> Agreed, but instead of adding sleep we should:
> a) call pthread_setconcurrency()
> b) devise a more life-like test
> c) not do anything cause it's working fine
>
> testlockperf is really just trying to gauge the overhead from the
> mutex routines, and I think it does a very good job of that. The secondary
> purpose of testlockperf is to compare the old locking API to the new
> one.
Without enforcing the lock routines to be run in parallel, you aren't
testing the expected common case - therefore, it isn't a good test.
Yes, you could call pthread_setconcurrency(), but I think you are
going to misjudge the appropriate number to pass to it (as I think
there is no number that makes sense for all cases). If you really
want pthread_setconcurrency to equal the number of threads, you want
to enforce a bound thread implementation (which is different than
creating a thread as bound with a multiplexed thread implementation).
At this point, we should both shut up and get some numbers. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 08:13:25PM -0700, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> > I don't think it's a quirk of the thread library, I think it's
> > fully expected. For the sake of others, here's an excerpt from the
> > Solaris 8 pthread_setconcurrency(3THR) man page:
>
> In testlockperf, you are assuming that all of the threads have
> started and will compete for the locks. In a M:N implementation,
> this assumption is false. You end up executing in serial rather
> than in parallel. This only occurs because you never hit a
> user-scheduler entry point in testlockperf. In the case of a MPM,
> you will be hitting them left and right. =-)
>
> Therefore, you need to devise a strategy within testlockperf to
> ensure that all of the threads are ready to compete before
> continuing the test. The suggested sleep is one way - condition
> variables *may* be possible, but it isn't completely obvious to
> me how that would work. -- justin
Agreed, but instead of adding sleep we should:
a) call pthread_setconcurrency()
b) devise a more life-like test
c) not do anything cause it's working fine
testlockperf is really just trying to gauge the overhead from the
mutex routines, and I think it does a very good job of that. The secondary
purpose of testlockperf is to compare the old locking API to the new
one.
> P.S. If you are running a site where you get 50,000 hits a minute,
> you shouldn't have MRPC at 10,000. I'd be curious to see what
> cnet runs with.
You're not going to get 50,000 hits a minute on any box that only has
~32,000 ports and Minimum Segment Length set to anything normal (like
2 minutes). My default Sol8 install can only take down 32k (non-keepalive)
hits in 4 minutes before all the sockets are sitting in TIME_WAIT.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 07:59:19PM -0700, Aaron Bannert wrote:
> I don't think it's a quirk of the thread library, I think it's
> fully expected. For the sake of others, here's an excerpt from the
> Solaris 8 pthread_setconcurrency(3THR) man page:
In testlockperf, you are assuming that all of the threads have
started and will compete for the locks. In a M:N implementation,
this assumption is false. You end up executing in serial rather
than in parallel. This only occurs because you never hit a
user-scheduler entry point in testlockperf. In the case of a MPM,
you will be hitting them left and right. =-)
Therefore, you need to devise a strategy within testlockperf to
ensure that all of the threads are ready to compete before
continuing the test. The suggested sleep is one way - condition
variables *may* be possible, but it isn't completely obvious to
me how that would work. -- justin
P.S. If you are running a site where you get 50,000 hits a minute,
you shouldn't have MRPC at 10,000. I'd be curious to see what
cnet runs with.
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 12:55:10AM -0700, Justin Erenkrantz wrote:
> I'm saying that it should never be used. Simple. You can't use
> that call properly in any real-world case - just like I don't think
> you should call sched_yield ever. You are attempting to solve a
> problem that is best solved somewhere else - the base operating
> system.
I aim to prove that there are cases where it is useful. I do not
think that sched_yield should be used, but that's a whole different
story (but I do think we should have a thread_yield for the sake
of netware and other totally userspace thread implementations
-- not to stir up the fire any more ;)
> The testlock case doesn't matter because it never hits any of the
> Solaris-defined entry points. This is a quirk in the OS and I see
> no reason to work around it. If you want to make testlock do the
> right thing with the Solaris LWP model, use a reader/writer lock
> to synchronize the starting of the threads. This way you guarantee
> that all threads are started before you start execution of the
> tight exclusive loop (which is something that testlock doesn't do
> now). You are assuming that the threads are created in parallel -
> nowhere is that ordering is guaranteed.
I don't think it's a quirk of the thread library, I think it's
fully expected. For the sake of others, here's an excerpt from the
Solaris 8 pthread_setconcurrency(3THR) man page:
DESCRIPTION
Unbound threads in a process may or may not be required to
be simultaneously active. By default, the threads implemen-
tation ensures that a sufficient number of threads are
active so that the process can continue to make progress.
While this conserves system resources, it may not produce
the most effective level of concurrency.
The pthread_setconcurrency() function allows an application
to inform the threads implementation of its desired con-
currency level, new_level. The actual level of concurrency
provided by the implementation as a result of this function
call is unspecified.
...
Although that is a very vague description of the mechanics of this
call, it does make it clear that the initial settings may not
be desired in all cases.
> > In consideration of your statement here I spend some time reading
> > the Solaris 8 libpthread source. On that platform your statement
> > here is false. Calling pthread_setconcurrency (or thr_setconcurrency
> > for that matter) can only change the number of multiplexed LWPs in
> > two ways: either not at all, or by increasing the number. I see
> > no way that it acts as a ceiling.
>
> Yes, you are correct and I was wrong - I reread the Solaris Internals
> book on my flight back to LAX today. It isn't a ceiling. However,
> the case of creating too many LWPs is completely valid and is brought
> up many times in their discussion of LWPs versus a bound thread model.
> Kernel threads are very expensive in Solaris and part of the reason
> that it handles threads well is because it multiplexes the kernel
> threads efficiently. No other OS I have seen handles threads as
> gracefully as Solaris.
Creating too many LWPs may be a problem, and is something I intend
on looking into. I do however feel this is something the application
writer is going to have to deal with case-by-case.
In my experimentations with setconcurrency I have arrived at some
conclusions (*on Solaris8):
- setconcurrency(0) has not affect on the number of LWPs.
- setconcurrency(n) will create new LWPs if n > current_num_lwps
else it will have no effect on the number of LWPs.
- if you set it too high, you will suffer performance
- if you set it too low, you will either not take advantage of other CPUs,
or you will not see it migrate tohe load to other CPUs until the "LWP
creation agent" decides it's time to do so.
> I believe SUSv2 called it a "hint" for the general case. However, in
> this specific implementation (multiplexed kernel threads), it is not
> a hint. It is a request to have that many LWPs. If you disagree
> with that statement, please look at the code again.
I was very clear in my previous message, and I have restated it in the
above statement. I was refuting the comment you made saying it was a
"command" and not a "hint". It is indeed a hint, only that in the
case where you ask for more LWPs than are currently allocated, it
will *attempt* to create more. In _all other cases_ it will simply
ignore the number you give it. It is not a ceiling.
> I pointed out that number (simultaneous requests) is a completely
> bogus number to use when dealing with multiplexed kernel threads.
> This poor choice is why I don't think this call belongs in APR at all.
> If you would care to claim that the number of simultaneous requests is
> the correct number in the context of a multiplexed thread model for
> worker, I would be delighted to hear why - you haven't offerred any
> proof as to its validity. I indicated why I thought that number was
> wrong. I'll repeat it again with a bit more of a technical
> explanation.
As I said at the beginning of this thread, I'd like to use this
call in more places than the worker MPM. I am not sure if this
will provide a benefit to the worker MPM, but if it does than
that is a good starting place.
> Creating all user threads as bound (what you are suggesting for
> worker by calling pthread_setconcurrency with that value) in a
> multiplexed thread model works against the thread model rather than
> with it - this indicates a clash in design. You want a bound thread
> library, but refuse to use a bound thread library.
It's actually worse than creating them as bound. In most cases a bound
thread has an early exit point to the system call in the userspace
implementation. Having a pool of LWPs available to a group of userspace
threads means that they have to be assigned. Bound means you get one
LWP forever.
> Ideally, most of worker MPM's time will be spent dealing with I/O, so
> there is no need to have spurious kernel threads when in such a usage
> pattern. Solaris has a number of safeguards that will ensure that any
> runnable thread (kernel or user) will run as quickly as it can and it
> will only create as many kernel threads as are actually dictated by
> the load (if there are really 8 threads ready to run, 8 execution
> contexts will be available).
>
> With "scheduler activations" (Solaris 2.6+), when a user thread is
> about to block and other user threads are waiting to execute, the
> running LWP will pass that unbound (but now blocked) thread off to
> an idle LWP (via doors). If no free LWPs are available (all LWPs
> are blocked or executing), a new LWP is spawned (via SIGWAITING)
> and the now-blocked unbound user thread is transferred.
>
> This blocked user thread will resume via what Solaris calls "user
> thread activation" - shared memory and a door call which indicates to
> the kernel thread when a user thread is ready for execution (i.e.
> needs the LWP active now because whatever blocked it has now been
> unblocked). So as soon as the message is sent, the kernel will
> reschedule the appropriate LWP.
>
> Okay, back to the original LWP that the user thread was on - it has
> time left on its original quantum because its user thread was about
> to end prematurely, it then searches for a waiting unbound thread to
> execute in the remainder of its time.
>
> In the common case of a user thread blocking with a free LWP already
> created, you have saved a kernel context switch (the running LWP
> sticks the user thread in an idle LWP by itself) - this is why this
> M*N implementation can be faster than bound threads. The context
> switch is free and the responsiveness is thus higher. This also
> causes it to create kernel threads as needed.
>
> The entire idea of a multiplexed kernel thread model (such as
> Solaris) is to minimize the number of actual kernel threads and
> increase responsiveness. You would be circumventing that
> decision by creating bound kernel threads that may not be
> actually required due to the actual execution pattern of the code.
> You will also decrease responsiveness because switching between
> threads now becomes a kernel issue rather than a cheap user-space
> issue (which is what Solaris wants to do by default). However,
> you do this in a library that was optimized for mulitple
> user-space threads not bound threads.
>
> I believe if you really want a bound thread implementation, you should
> tell the OS you want it - not muck around with an indeterminate API to
> do so that directly circumvents the scheduling/balancing process.
I don't want a bound thread impl, or I would have done that with the
thread attribute at creation time. I want the threads to ramp up fast
and I want them to migrate to other CPUs quickly.
> > There you go again with this "OS scheduler" thing that I've never heard
> > of. 10 seconds to stabilize is rather long when you consider I have
> > already served O(5000) requests.
>
> You are really attempting to make this a personal argument here by
> attacking me. I think this is completely uncalled for and
> inappropriate.
I apoligise for the more snide comments made in my previous message.
They were perhaps inappropriate in this forum. I do however expect
this discussion to narrow in on the facts and come to a rational
conclusion instead of lingering on vague undefined concepts.
> 10 seconds isn't a long time for a server that will be up for months
> or years. And, as you said, you pulled that number (10 seconds) out
> of thin air. If you can substantiate it with real results, please
> provide them. I don't consider a case of a 10 second delay for the
> OS to properly balance itself with a particular thread model an issue.
> And, what is the impact of not having enough LWPs initially? Were
> you testing on a SMP or UP box? What was the type of CPU load that
> was being performed before it was balanced (usr, sys, or iowait)?
Unfortunately it may not be true that the server will be up for months or
years. In the best of cases we can hope for a MaxRequestsPerChild to be
infinite, but the reality is that 3rd party modules (and even httpd)
may leak memory. IIRC, the default MaxRequestsPerChild is 10000.
If it is taking me 5000 requests to reach a steady state, we are spending
half our time trying to ramp up before having to start all over again.
> You also haven't mentioned how many LWPs it stabilized at after
> 10 seconds? Did Solaris choose to add a LWP for each user thread?
> I have a feeling it wouldn't, but I may be wrong. -- justin
I'll follow up this reply with some real numbers.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sat, Sep 15, 2001 at 04:43:39PM -0700, Aaron Bannert wrote:
> > If you create too many LWPs, you will lose a lot of optimizations
> > that are present in Solaris (i.e. handover of a mutex to another
> > thread in the same LWP - as discussed with bpane on dev@httpd
> > recently).
>
> Of course, and that is something the caller needs to take into consideration.
> I'm not forcing you to use it, I just think it needs to be available.
I'm saying that it should never be used. Simple. You can't use
that call properly in any real-world case - just like I don't think
you should call sched_yield ever. You are attempting to solve a
problem that is best solved somewhere else - the base operating
system.
The testlock case doesn't matter because it never hits any of the
Solaris-defined entry points. This is a quirk in the OS and I see
no reason to work around it. If you want to make testlock do the
right thing with the Solaris LWP model, use a reader/writer lock
to synchronize the starting of the threads. This way you guarantee
that all threads are started before you start execution of the
tight exclusive loop (which is something that testlock doesn't do
now). You are assuming that the threads are created in parallel -
nowhere is that ordering is guaranteed.
> In consideration of your statement here I spend some time reading
> the Solaris 8 libpthread source. On that platform your statement
> here is false. Calling pthread_setconcurrency (or thr_setconcurrency
> for that matter) can only change the number of multiplexed LWPs in
> two ways: either not at all, or by increasing the number. I see
> no way that it acts as a ceiling.
Yes, you are correct and I was wrong - I reread the Solaris Internals
book on my flight back to LAX today. It isn't a ceiling. However,
the case of creating too many LWPs is completely valid and is brought
up many times in their discussion of LWPs versus a bound thread model.
Kernel threads are very expensive in Solaris and part of the reason
that it handles threads well is because it multiplexes the kernel
threads efficiently. No other OS I have seen handles threads as
gracefully as Solaris.
My guess is that in Solaris 9 they reworked the kernel thread API to
be much faster than before so that it achieves similar
creation/switching/destruction times to the user-space LWP threads.
If they did that, I believe that it then makes sense to switch to
bound threads by default. (I do need to double check that they have
switched to a bound threads by default in Solaris 9.)
> > This is not a hint, but a
> > command. (Yes, the man page for Solaris says that it is a hint,
> > but it treats it as a command.)
>
> Sorry, but that's just BS, and I don't know where you get off making such
> bold unfounded statements. Please just go read the source, they match
> the man pages.
I believe SUSv2 called it a "hint" for the general case. However, in
this specific implementation (multiplexed kernel threads), it is not
a hint. It is a request to have that many LWPs. If you disagree
with that statement, please look at the code again.
> > - Let the programmer decide. Awfully bad choice. Who knows
> > how the system is setup? What are you optimizing for?
>
> This is the only choice I proposed, I don't know what the heck you are
> arguing about in these other things. Of course let the programmer decide,
> that's why it's an API!
>
> I just gave you an example where I would use it: in the worker MPM.
> In that case it would be the number of simultaneous requests I expect
> to serve.
I pointed out that number (simultaneous requests) is a completely
bogus number to use when dealing with multiplexed kernel threads.
This poor choice is why I don't think this call belongs in APR at all.
If you would care to claim that the number of simultaneous requests is
the correct number in the context of a multiplexed thread model for
worker, I would be delighted to hear why - you haven't offerred any
proof as to its validity. I indicated why I thought that number was
wrong. I'll repeat it again with a bit more of a technical
explanation.
Creating all user threads as bound (what you are suggesting for
worker by calling pthread_setconcurrency with that value) in a
multiplexed thread model works against the thread model rather than
with it - this indicates a clash in design. You want a bound thread
library, but refuse to use a bound thread library.
Ideally, most of worker MPM's time will be spent dealing with I/O, so
there is no need to have spurious kernel threads when in such a usage
pattern. Solaris has a number of safeguards that will ensure that any
runnable thread (kernel or user) will run as quickly as it can and it
will only create as many kernel threads as are actually dictated by
the load (if there are really 8 threads ready to run, 8 execution
contexts will be available).
With "scheduler activations" (Solaris 2.6+), when a user thread is
about to block and other user threads are waiting to execute, the
running LWP will pass that unbound (but now blocked) thread off to
an idle LWP (via doors). If no free LWPs are available (all LWPs
are blocked or executing), a new LWP is spawned (via SIGWAITING)
and the now-blocked unbound user thread is transferred.
This blocked user thread will resume via what Solaris calls "user
thread activation" - shared memory and a door call which indicates to
the kernel thread when a user thread is ready for execution (i.e.
needs the LWP active now because whatever blocked it has now been
unblocked). So as soon as the message is sent, the kernel will
reschedule the appropriate LWP.
Okay, back to the original LWP that the user thread was on - it has
time left on its original quantum because its user thread was about
to end prematurely, it then searches for a waiting unbound thread to
execute in the remainder of its time.
In the common case of a user thread blocking with a free LWP already
created, you have saved a kernel context switch (the running LWP
sticks the user thread in an idle LWP by itself) - this is why this
M*N implementation can be faster than bound threads. The context
switch is free and the responsiveness is thus higher. This also
causes it to create kernel threads as needed.
The entire idea of a multiplexed kernel thread model (such as
Solaris) is to minimize the number of actual kernel threads and
increase responsiveness. You would be circumventing that
decision by creating bound kernel threads that may not be
actually required due to the actual execution pattern of the code.
You will also decrease responsiveness because switching between
threads now becomes a kernel issue rather than a cheap user-space
issue (which is what Solaris wants to do by default). However,
you do this in a library that was optimized for mulitple
user-space threads not bound threads.
I believe if you really want a bound thread implementation, you should
tell the OS you want it - not muck around with an indeterminate API to
do so that directly circumvents the scheduling/balancing process.
> There you go again with this "OS scheduler" thing that I've never heard
> of. 10 seconds to stabilize is rather long when you consider I have
> already served O(5000) requests.
You are really attempting to make this a personal argument here by
attacking me. I think this is completely uncalled for and
inappropriate.
10 seconds isn't a long time for a server that will be up for months
or years. And, as you said, you pulled that number (10 seconds) out
of thin air. If you can substantiate it with real results, please
provide them. I don't consider a case of a 10 second delay for the
OS to properly balance itself with a particular thread model an issue.
And, what is the impact of not having enough LWPs initially? Were
you testing on a SMP or UP box? What was the type of CPU load that
was being performed before it was balanced (usr, sys, or iowait)?
You also haven't mentioned how many LWPs it stabilized at after
10 seconds? Did Solaris choose to add a LWP for each user thread?
I have a feeling it wouldn't, but I may be wrong. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Fri, Sep 14, 2001 at 06:33:47PM -0700, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 04:21:51PM -0700, Aaron Bannert wrote:
> > Why would this circumvent the OS scheduler at all? In all cases it
> > is a *hint*. Please be more precise.
> >
> > I think I showed you an example awhile ago where compute-bound threads
> > behave drastically different depending on the operating system. In
> > the case of solaris, a computationally intensive thread that makes no
> > system calls* will not automatically yield a mutex when entering/exiting
> > a critical section, unless pthread_setconcurrency() is called.
>
> That statement isn't necessarily correct. What actually happens
> is that the user scheduler in Solaris never gets executed because
> none of the entry points as defined by the OS (i.e. system calls)
> get executed to trigger the user scheduler's activity during the
> compute-bound function call. It isn't that it doesn't yield the
> mutex - it is that there is no other thread to yield to as the
> scheduler on Solaris gives the thread a chance to run *before*
> launching the next thread. This is a conscious decision on Sun's
> part when designing their scheduler for Solaris (up to but not
> including 9).
This may be a more accurate description, but I think we're talking
about the same thing here.
> If you create too many LWPs, you will lose a lot of optimizations
> that are present in Solaris (i.e. handover of a mutex to another
> thread in the same LWP - as discussed with bpane on dev@httpd
> recently).
Of course, and that is something the caller needs to take into consideration.
I'm not forcing you to use it, I just think it needs to be available.
> If you don't create enough LWPs, you may enter a
> condition where the scheduler refuses to balance the processes
> correctly (it also acts as a ceiling).
In consideration of your statement here I spend some time reading
the Solaris 8 libpthread source. On that platform your statement
here is false. Calling pthread_setconcurrency (or thr_setconcurrency
for that matter) can only change the number of multiplexed LWPs in
two ways: either not at all, or by increasing the number. I see
no way that it acts as a ceiling.
For the curious, the code in question begins at:
/usr/src/lib/libthread/common/thread.c:332
(pthread_* is built on top of thr_*)
> 0 lets the OS determine
> the concurrency (on Solaris).
Again, on solaris 8, calling pthread_setconcurrency(0) has absolutely
no effect on the number of LWPs (which, I might add, is what it states
on the man page).
> By setting a value, you are attempting to circumvent the OS
> scheduler. If you ask it to set the concurrency on Solaris, it
> *will* create enough LWPs to equal that concurrency (as you
> create threads to be paired with LWPs).
"I do not think that means what you think it means."
I still don't know what you mean by "circumvent the OS scheduler",
but the second statement is correct, and my point is that is exactly
what I want it to do.
> This is not a hint, but a
> command. (Yes, the man page for Solaris says that it is a hint,
> but it treats it as a command.)
Sorry, but that's just BS, and I don't know where you get off making such
bold unfounded statements. Please just go read the source, they match
the man pages.
> Talking about other OSes besides Solaris is moot because they don't
> implement a M*N scheduling strategy. With a bound thread
> implementation, pthread_setconcurrency is a no-op (what else can
> it do?). It can only be effective in the case of a LWP-like
> (multiplexing a kernel thread) scheduling strategy.
Great. This is what I said at the start of this thread. So do you have
a good reason to keep it out of APR or not?
> Furthermore, I think that any values that you may pass into
> pthread_setconcurrency are inherently wrong. What values will
> you use to set this? The number of threads? The number of CPUs?
> Let the programmer decide? Let the user decide? IMHO, all of these
> are bad choices:
[self-fulfilling answers omitted]
> - Let the programmer decide. Awfully bad choice. Who knows
> how the system is setup? What are you optimizing for?
This is the only choice I proposed, I don't know what the heck you are
arguing about in these other things. Of course let the programmer decide,
that's why it's an API!
I just gave you an example where I would use it: in the worker MPM.
In that case it would be the number of simultaneous requests I expect
to serve.
> So, what do I think the correct solution is? Let the OS decide
> (exactly what it does now). The OS has access to much better
> information to make these decisions (i.e. load averages, I/O wait,
> other processes, num CPUs, etc.). The goal of the OS is to balance
> competing processes. Circumventing the OS scheduler by forcing it
> to create too many or too few LWPs is the wrong thing.
Oh, just stop that. You can't keep saying "circumventing the OS scheduler"
when that doesn't mean anything! You surely don't mean "somehow getting
around the process scheduler", so just quit it! We're not hacking the
kernel here, we're using fully published POSIX APIs!
> The case of a compute-bound thread merely falls into a specific
> trap on a specific OS with a specific thread model. This case
> is typically evident in benchmarks not the real-world. Most
> applications will enter a system call at *some* point.
>
> > In a practical sense, when I was playing with the worker MPM I noticed
> > that under high load (maxing out the CPU) it took on the order of 10
> > seconds** for the number of LWPs to stablize.
>
> I'll live with that - this is due to inherent OS scheduler
> characteristics. After 10 seconds, the system stabilizes - the
> OS has performed its job. Is there any evidence that this value
> that it stabilized at is incorrect? What formula would you have
> used to set that number? Any "hint" that we may give it may end
> up back-firing rather than helping.
There you go again with this "OS scheduler" thing that I've never heard
of. 10 seconds to stabilize is rather long when you consider I have
already served O(5000) requests.
> In fact, the best solution may be to provide a configure-time
> option to help the user select the "right" thread model on
> Solaris (i.e. /usr/lib/libthread.so or /usr/lib/lwp/libthread.so).
> You can recommend using the "alternative" thread model for
> certain types of compute-bound applications. (However, be careful
> on Solaris 9 as they are reversed.) -- justin
The configure-time option you're talking about is LDFLAGS. You can
also do it at runtime with LD_PRELOAD.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Fri, Sep 14, 2001 at 04:21:51PM -0700, Aaron Bannert wrote:
> Why would this circumvent the OS scheduler at all? In all cases it
> is a *hint*. Please be more precise.
>
> I think I showed you an example awhile ago where compute-bound threads
> behave drastically different depending on the operating system. In
> the case of solaris, a computationally intensive thread that makes no
> system calls* will not automatically yield a mutex when entering/exiting
> a critical section, unless pthread_setconcurrency() is called.
That statement isn't necessarily correct. What actually happens
is that the user scheduler in Solaris never gets executed because
none of the entry points as defined by the OS (i.e. system calls)
get executed to trigger the user scheduler's activity during the
compute-bound function call. It isn't that it doesn't yield the
mutex - it is that there is no other thread to yield to as the
scheduler on Solaris gives the thread a chance to run *before*
launching the next thread. This is a conscious decision on Sun's
part when designing their scheduler for Solaris (up to but not
including 9).
If you create too many LWPs, you will lose a lot of optimizations
that are present in Solaris (i.e. handover of a mutex to another
thread in the same LWP - as discussed with bpane on dev@httpd
recently). If you don't create enough LWPs, you may enter a
condition where the scheduler refuses to balance the processes
correctly (it also acts as a ceiling). 0 lets the OS determine
the concurrency (on Solaris).
By setting a value, you are attempting to circumvent the OS
scheduler. If you ask it to set the concurrency on Solaris, it
*will* create enough LWPs to equal that concurrency (as you
create threads to be paired with LWPs). This is not a hint, but a
command. (Yes, the man page for Solaris says that it is a hint,
but it treats it as a command.)
Talking about other OSes besides Solaris is moot because they don't
implement a M*N scheduling strategy. With a bound thread
implementation, pthread_setconcurrency is a no-op (what else can
it do?). It can only be effective in the case of a LWP-like
(multiplexing a kernel thread) scheduling strategy.
Furthermore, I think that any values that you may pass into
pthread_setconcurrency are inherently wrong. What values will
you use to set this? The number of threads? The number of CPUs?
Let the programmer decide? Let the user decide? IMHO, all of these
are bad choices:
- Use number of threads. When concerning ourselves with the
Solaris M*N scheduler, this is horrific because we have now
lost the optimizations and may have created too many LWPs. When you
use a bound thread library on Solaris, the overhead of the (now)
useless optimizations don't occur. So, if you want to use the number
of threads on Solaris, use the bound thread library instead of the
LWP thread library. This obviates the need for pthread_setconcurrency,
since by definition all threads are kernel threads.
- Use number of CPUs. How would you get this number? Also, it
is a bit of a red herring because it is not a good number because
your application may be sharing resources with other processes.
If you are primarily I/O-bound, you have just created too many
LWPs and have to incur their overhead because most of the time
the threads are going to be idle waiting for IO.
- Let the programmer decide. Awfully bad choice. Who knows
how the system is setup? What are you optimizing for?
- Let the user decide via a configuration option (like a MPM
directive). I don't think that we can expect the user to
fully understand the meaning of this value. More often then not,
they may set it to either of the wrong values described above.
So, what do I think the correct solution is? Let the OS decide
(exactly what it does now). The OS has access to much better
information to make these decisions (i.e. load averages, I/O wait,
other processes, num CPUs, etc.). The goal of the OS is to balance
competing processes. Circumventing the OS scheduler by forcing it
to create too many or too few LWPs is the wrong thing.
The case of a compute-bound thread merely falls into a specific
trap on a specific OS with a specific thread model. This case
is typically evident in benchmarks not the real-world. Most
applications will enter a system call at *some* point.
> In a practical sense, when I was playing with the worker MPM I noticed
> that under high load (maxing out the CPU) it took on the order of 10
> seconds** for the number of LWPs to stablize.
I'll live with that - this is due to inherent OS scheduler
characteristics. After 10 seconds, the system stabilizes - the
OS has performed its job. Is there any evidence that this value
that it stabilized at is incorrect? What formula would you have
used to set that number? Any "hint" that we may give it may end
up back-firing rather than helping.
In fact, the best solution may be to provide a configure-time
option to help the user select the "right" thread model on
Solaris (i.e. /usr/lib/libthread.so or /usr/lib/lwp/libthread.so).
You can recommend using the "alternative" thread model for
certain types of compute-bound applications. (However, be careful
on Solaris 9 as they are reversed.) -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Fri, Sep 14, 2001 at 03:49:59PM -0700, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 03:44:48PM -0700, Aaron Bannert wrote:
> > I'd like to propose we add a call that gives a hint to the OS as to
> > the level of concurrency we wish to have. This would mirror
> > pthread_setconcurrency(), and would be a simple call to that on
> > operating systems that have it available. On other platforms it
> > would be simple noop.
>
> The problem with this is that we are going to circumvent the OS
> scheduler which I think is a bad idea - unless we can show where the
> OS falls down on the job (except in the pedantic case of testthread).
> -- justin
Why would this circumvent the OS scheduler at all? In all cases it
is a *hint*. Please be more precise.
I think I showed you an example awhile ago where compute-bound threads
behave drastically different depending on the operating system. In
the case of solaris, a computationally intensive thread that makes no
system calls* will not automatically yield a mutex when entering/exiting
a critical section, unless pthread_setconcurrency() is called.
In a practical sense, when I was playing with the worker MPM I noticed
that under high load (maxing out the CPU) it took on the order of 10
seconds** for the number of LWPs to stablize.
*Really, these are calls that check the userspace run-queue.
**A number I pulled out of my arse...it took awhile at least. Once it
stabalized, the system could handle the load with a few extra cycles to spare.
I'm not sure if it's significant for the worker MPM, but I can show cases
where it is significant.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Fri, Sep 14, 2001 at 03:44:48PM -0700, Aaron Bannert wrote:
> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.
The problem with this is that we are going to circumvent the OS
scheduler which I think is a bad idea - unless we can show where the
OS falls down on the job (except in the pedantic case of testthread).
-- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Ryan Bloom <rb...@covalent.net>.
On Friday 14 September 2001 03:44 pm, Aaron Bannert wrote:
+1
Ryan
> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.
>
> Give me some +1s and I'll submit a patch.
>
> -aaron
--
______________________________________________________________
Ryan Bloom rbb@apache.org
Covalent Technologies rbb@covalent.net
--------------------------------------------------------------
Re: Solaris 8 and 9 thread libraries was Re: [proposal]
apr_thread_setconcurrency()
Posted by Brian Pane <bp...@pacbell.net>.
Justin Erenkrantz wrote:
>On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
>
>>Yup. More precisely /usr/lib/lwp/libthread.so is the "alternate"
>>version and /usr/lib/libthread.so is the "default" version. They
>>are binary compatible (as far as we care) - therefore the
>>LD_LIBRARY_PATH trick works. With Solaris 8 (first one to have
>>this alternate version), the default is to use LWPs and the
>>alternate is bound threads. AFAIK, Solaris 9 switches them.
>>
>
>% uname -srvm
>SunOS 5.9 Beta sun4u
>% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
>-rwxr-xr-x 1 root bin 129168 Jun 20 10:40 /usr/lib/libthread.so.1
>lrwxrwxrwx 1 root root 17 Aug 29 16:41 /usr/lib/lwp/libthread.so.1 -> ../libthread.so.1
>
>% uname -srvm
>SunOS 5.8 Generic_108529-09 i86pc
>jerenkrantz@boris% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
>-rwxr-xr-x 1 root bin 170724 Jan 24 2001 /usr/lib/libthread.so.1
>-rwxr-xr-x 1 root bin 108620 Feb 22 2001 /usr/lib/lwp/libthread.so.1
>
>As you can see, the alternate threading library on Solaris 9 just points
>to /usr/lib/libthread.so.1. Based on what I can see, LWPs are still
>present, but the performance characteristics and functions executed
>*looks* like it is a bound thread implementation. I don't have access
>
That sounds like what I'd expect, based on earlier descriptions of how
Solaris 9 would have a single-layer thread model by default: presumably
they kept the LWP architecture (which seems rather an integral part of
the kernel's process management) and modified the thread and pthread libs
to make user-layer threads automatically bound to LWPs.
--Brian
Solaris 8 and 9 thread libraries was Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
> Yup. More precisely /usr/lib/lwp/libthread.so is the "alternate"
> version and /usr/lib/libthread.so is the "default" version. They
> are binary compatible (as far as we care) - therefore the
> LD_LIBRARY_PATH trick works. With Solaris 8 (first one to have
> this alternate version), the default is to use LWPs and the
> alternate is bound threads. AFAIK, Solaris 9 switches them.
% uname -srvm
SunOS 5.9 Beta sun4u
% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
-rwxr-xr-x 1 root bin 129168 Jun 20 10:40 /usr/lib/libthread.so.1
lrwxrwxrwx 1 root root 17 Aug 29 16:41 /usr/lib/lwp/libthread.so.1 -> ../libthread.so.1
% uname -srvm
SunOS 5.8 Generic_108529-09 i86pc
jerenkrantz@boris% ls -l /usr/lib/lwp/libthread.so.1 /usr/lib/libthread.so.1
-rwxr-xr-x 1 root bin 170724 Jan 24 2001 /usr/lib/libthread.so.1
-rwxr-xr-x 1 root bin 108620 Feb 22 2001 /usr/lib/lwp/libthread.so.1
As you can see, the alternate threading library on Solaris 9 just points
to /usr/lib/libthread.so.1. Based on what I can see, LWPs are still
present, but the performance characteristics and functions executed
*looks* like it is a bound thread implementation. I don't have access
to the source code, so I have no clue what is going on. -- justin
Re: AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()
Posted by "Victor J. Orlikowski" <vj...@dulug.duke.edu>.
On Sunday, 16 Sep 2001, at 20:05:00,
Aaron Bannert wrote:
> I wouldn't be surprised if it did the same thing as Solaris on
> testlockperf. It remains to be seen, however, how the libs perform
> on Solaris/AIX under real-world scenarios (worker MPM) w/ and w/o
> setconcurrency, and on Uni and Multiprocessor machines.
Easy enough to test (w/out playing too much in the code) on AIX.
Check the AIXTHREAD_MNRATIO environment variable.
Victor
--
Victor J. Orlikowski | The Wall is Down, But the Threat Remains!
==================================================================
orlikowski@apache.org | vjo@dulug.duke.edu | vjo@us.ibm.com
Re: AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 07:44:58PM -0700, Justin Erenkrantz wrote:
> On Sun, Sep 16, 2001 at 07:11:41PM -0700, Aaron Bannert wrote:
> > The only platforms that I know about that have a two-level thread model
> > are AIX and Solaris. The single-level thread libs ignore setconcurrency
> > because every thread is what solaris calls a "bound thread", or a kernel
> > scheduled entity (it gets it's own process slot). The only exceptions
> > to this rule are fully userspace thread libs, where setconcurrency is
> > inherently maximized at 1.
>
> Oh, crap, you're right. AIX has M:N threads by default in 4.3.1+.
> (Isn't it funny that IBM is adopting something that Sun is ditching?)
>
> Okay, so, what does setconcurrency do on AIX? How does testlockperf
> work on MP AIX boxes? I bet it'd do the same bad things as Solaris
> does. But, I know nothing about AIX. -- justin
I wouldn't be surprised if it did the same thing as Solaris on
testlockperf. It remains to be seen, however, how the libs perform
on Solaris/AIX under real-world scenarios (worker MPM) w/ and w/o
setconcurrency, and on Uni and Multiprocessor machines.
Whew that's a lot of variables...
Someone want to send me an AIX SMP box? ;)
-aaron
AIX M:N threads? was Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 07:11:41PM -0700, Aaron Bannert wrote:
> The only platforms that I know about that have a two-level thread model
> are AIX and Solaris. The single-level thread libs ignore setconcurrency
> because every thread is what solaris calls a "bound thread", or a kernel
> scheduled entity (it gets it's own process slot). The only exceptions
> to this rule are fully userspace thread libs, where setconcurrency is
> inherently maximized at 1.
Oh, crap, you're right. AIX has M:N threads by default in 4.3.1+.
(Isn't it funny that IBM is adopting something that Sun is ditching?)
Okay, so, what does setconcurrency do on AIX? How does testlockperf
work on MP AIX boxes? I bet it'd do the same bad things as Solaris
does. But, I know nothing about AIX. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 04:12:58PM -0700, Justin Erenkrantz wrote:
> > But of course that case is not terribly relevant for something like
> > httpd-2.0 on a big SMP box, where the optimal case (of which there are
> > many dimentions) can not be known to the underlying thread/LWP creation
> > agent. That is the key issue at hand here. We, as _users_ of this API
> > would like to maximize each of {requests/second, time/request, number of
> > simultaneous connections} where the LWP creation agent is just trying to
> > get the work done with the least amount of context switching. The dials it
> > has to play with are numerous, and so it must perform a delicate linear
> > programming task in an attempt to meet the same goals as the application
> > programmer. I don't claim that setconcurrency is the way to reduce the
> > number of variables in this equation, but I do suggest we may want to
> > take this into consideration when trying to make our threaded algorithms
> > work the way we expect them to.
>
> I just don't think it is going to get us what you want. I think
> the net result with setconcurrency on Solaris with LWPs is to
> circumvent its balancing algorithms so that it creates too many
> LWPs. I think this is the wrong way to attack this problem and
> goes against the design of their thread library. On all other
> platforms (and with bound thread impl on Solaris), setconcurrency
> is an ignored hint. -- justin
The only platforms that I know about that have a two-level thread model
are AIX and Solaris. The single-level thread libs ignore setconcurrency
because every thread is what solaris calls a "bound thread", or a kernel
scheduled entity (it gets it's own process slot). The only exceptions
to this rule are fully userspace thread libs, where setconcurrency is
inherently maximized at 1.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 03:29:37PM -0700, Aaron Bannert wrote:
> (just a question, when you say "bound thread impl" you mean
> /usr/lib/lwp/libpthread, right? and the "LWP version" is the default
> /usr/lib/libpthread. Let me know if I have this backward. Maybe
> "LWP version" isn't the best name for the two libs, since they both
> do LWPs.)
Yup. More precisely /usr/lib/lwp/libthread.so is the "alternate"
version and /usr/lib/libthread.so is the "default" version. They
are binary compatible (as far as we care) - therefore the
LD_LIBRARY_PATH trick works. With Solaris 8 (first one to have
this alternate version), the default is to use LWPs and the
alternate is bound threads. AFAIK, Solaris 9 switches them.
> I don't really see this as a scheduling quirk on Solaris. I think we
> would all agree that the same "work" is performed in each of the 4 tests,
> with or without sleeps, and with or without setconcurrency. It is also
> obvious that since the same "work" is performed in each case, that the
> obvious winner for this made-up performance test is the one that does
> the work the fastest; which happens to be the one that creates no new
> LWPs and therefore minimizes the number of kernel context switches.
What happens is that when the sleeps are not present, three of the
threads do not have a chance to execute the test function (they are
stuck in the libc thread_start function). They are monopolized by
the one thread that raced to the beginning and started working. The
sleep allows all of the threads to hit the lock acquire within
testlockperf (remember that the code that spawns the threads has the
lock so the threads can't acquire it - they go to sleep). Once the
spawning thread releases the lock, all of the threads now wakeup
and compete. Watch it with gdb and pstack on a MP box. =-)
> But of course that case is not terribly relevant for something like
> httpd-2.0 on a big SMP box, where the optimal case (of which there are
> many dimentions) can not be known to the underlying thread/LWP creation
> agent. That is the key issue at hand here. We, as _users_ of this API
> would like to maximize each of {requests/second, time/request, number of
> simultaneous connections} where the LWP creation agent is just trying to
> get the work done with the least amount of context switching. The dials it
> has to play with are numerous, and so it must perform a delicate linear
> programming task in an attempt to meet the same goals as the application
> programmer. I don't claim that setconcurrency is the way to reduce the
> number of variables in this equation, but I do suggest we may want to
> take this into consideration when trying to make our threaded algorithms
> work the way we expect them to.
I just don't think it is going to get us what you want. I think
the net result with setconcurrency on Solaris with LWPs is to
circumvent its balancing algorithms so that it creates too many
LWPs. I think this is the wrong way to attack this problem and
goes against the design of their thread library. On all other
platforms (and with bound thread impl on Solaris), setconcurrency
is an ignored hint. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Sun, Sep 16, 2001 at 02:02:39PM -0700, Justin Erenkrantz wrote:
> This is on a 8-way box, right? Those numbers look about right
> for the bound thread implementation. However, the LWP version
> still looks like it isn't doing the right thing.
(just a question, when you say "bound thread impl" you mean
/usr/lib/lwp/libpthread, right? and the "LWP version" is the default
/usr/lib/libpthread. Let me know if I have this backward. Maybe
"LWP version" isn't the best name for the two libs, since they both
do LWPs.)
> I think you probably need this patch for the LWP version.
> I'm not sure whether setconcurrency will produce the same
> effect - it might.
>
> I'm not going to commit this because I know it's the wrong thing
> (should be condition vars I think), but it solves the scheduling
> quirk on Solaris for now. -- justin
I don't really see this as a scheduling quirk on Solaris. I think we
would all agree that the same "work" is performed in each of the 4 tests,
with or without sleeps, and with or without setconcurrency. It is also
obvious that since the same "work" is performed in each case, that the
obvious winner for this made-up performance test is the one that does
the work the fastest; which happens to be the one that creates no new
LWPs and therefore minimizes the number of kernel context switches.
But of course that case is not terribly relevant for something like
httpd-2.0 on a big SMP box, where the optimal case (of which there are
many dimentions) can not be known to the underlying thread/LWP creation
agent. That is the key issue at hand here. We, as _users_ of this API
would like to maximize each of {requests/second, time/request, number of
simultaneous connections} where the LWP creation agent is just trying to
get the work done with the least amount of context switching. The dials it
has to play with are numerous, and so it must perform a delicate linear
programming task in an attempt to meet the same goals as the application
programmer. I don't claim that setconcurrency is the way to reduce the
number of variables in this equation, but I do suggest we may want to
take this into consideration when trying to make our threaded algorithms
work the way we expect them to.
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Sun, Sep 16, 2001 at 01:53:42PM -0700, Ian Holsman wrote:
> $ ./testlockperf
> APR Lock Performance Test
> ==============
>
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> Initializing the apr_lock_t OK
> Starting all the threads OK
> microseconds: 9634489 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> Starting all the threads OK
> microseconds: 7333845 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> Initializing the apr_lock_t OK
> microseconds: 11365100 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> microseconds: 8443761 usec
>
> $ export LD_LIBRARY_PATH=/usr/lib/lwp/
>
> $ ./testlockperf
> APR Lock Performance Test
> ==============
>
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> Initializing the apr_lock_t OK
> Starting all the threads OK
> microseconds: 25322674 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> Starting all the threads OK
> microseconds: 23590762 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> Initializing the apr_lock_t OK
> microseconds: 23106303 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> microseconds: 19515490 usec
This is on a 8-way box, right? Those numbers look about right
for the bound thread implementation. However, the LWP version
still looks like it isn't doing the right thing.
I think you probably need this patch for the LWP version.
I'm not sure whether setconcurrency will produce the same
effect - it might.
I'm not going to commit this because I know it's the wrong thing
(should be condition vars I think), but it solves the scheduling
quirk on Solaris for now. -- justin
Index: testlockperf.c
===================================================================
RCS file: /home/cvs/apr/test/testlockperf.c,v
retrieving revision 1.4
diff -u -r1.4 testlockperf.c
--- testlockperf.c 2001/09/16 17:15:39 1.4
+++ testlockperf.c 2001/09/16 20:47:37
@@ -176,6 +176,7 @@
}
printf("OK\n");
+ apr_sleep(10000);
time_start = apr_time_now();
apr_lock_release(inter_lock);
@@ -226,6 +227,7 @@
}
printf("OK\n");
+ apr_sleep(10000);
time_start = apr_time_now();
apr_thread_mutex_unlock(thread_lock);
@@ -275,6 +277,7 @@
return s1;
}
+ apr_sleep(10000);
time_start = apr_time_now();
apr_lock_release(inter_rwlock);
@@ -323,6 +326,7 @@
return s1;
}
+ apr_sleep(10000);
time_start = apr_time_now();
apr_thread_rwlock_unlock(thread_rwlock);
Re: [proposal] apr_thread_setconcurrency()
Posted by Ian Holsman <Ia...@cnet.com>.
On Sun, 2001-09-16 at 11:21, Justin Erenkrantz wrote:
> On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > > Oh...
> > > I ran the the testlockperf code on the 8-way box, with
> > > the pthread_setconcurrency calls commented out, and with
> > > the concurrency calls put in (setting them to 8).
> > > results are as follows
>
> Could you rerun this test with LD_LIBRARY_PATH set like:
>
> LD_LIBRARY_PATH=/usr/lib/lwp ./testlockperf
>
> What do you see? My results and comments:
>
> http://www.apache.org/~jerenkrantz/testlockperf.html
>
> Please read it if you have a chance. -- justin
(I commented ut the setconcurreny calls)
$ ./testlockperf
APR Lock Performance Test
==============
apr_lock(INTRAPROCESS, MUTEX) Lock Tests
Initializing the apr_lock_t OK
Starting all the threads OK
microseconds: 9634489 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
Starting all the threads OK
microseconds: 7333845 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
Initializing the apr_lock_t OK
microseconds: 11365100 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
microseconds: 8443761 usec
$ export LD_LIBRARY_PATH=/usr/lib/lwp/
$ ./testlockperf
APR Lock Performance Test
==============
apr_lock(INTRAPROCESS, MUTEX) Lock Tests
Initializing the apr_lock_t OK
Starting all the threads OK
microseconds: 25322674 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
Starting all the threads OK
microseconds: 23590762 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
Initializing the apr_lock_t OK
microseconds: 23106303 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
microseconds: 19515490 usec
--
Ian Holsman
Performance Measurement & Analysis
CNET Networks - 415 364-8608
Re: [proposal] apr_thread_setconcurrency()
Posted by Justin Erenkrantz <je...@ebuilt.com>.
On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > Oh...
> > I ran the the testlockperf code on the 8-way box, with
> > the pthread_setconcurrency calls commented out, and with
> > the concurrency calls put in (setting them to 8).
> > results are as follows
Could you rerun this test with LD_LIBRARY_PATH set like:
LD_LIBRARY_PATH=/usr/lib/lwp ./testlockperf
What do you see? My results and comments:
http://www.apache.org/~jerenkrantz/testlockperf.html
Please read it if you have a chance. -- justin
Re: [proposal] apr_thread_setconcurrency()
Posted by Aaron Bannert <aa...@clove.org>.
On Fri, Sep 14, 2001 at 08:16:03PM -0700, Ian Holsman wrote:
> > +1 IF the number you set it to is a hint, and solaris can changes the
> > concurrency afterwards according to the load on the system/internal
> > guidelines.
This is how it appears to work according to the source. I'll try fooling
with worker and make sure it works the way I'm expecting.
> > I ran the the testlockperf code on the 8-way box, with
> > the pthread_setconcurrency calls commented out, and with
> > the concurrency calls put in (setting them to 8).
> > results are as follows
> >
> > (without setconcurrency)
> > APR Lock Performance Test
> > ==============
> >
> > apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> > microseconds: 9373710 usec
> > apr_thread_mutex_t Tests
> > microseconds: 7304314 usec
> > apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> > microseconds: 11247506 usec
> > apr_thread_mutex_t Tests
> > microseconds: 8148914 usec
> >
> > (with pthread_setconcurrency(8) where you put the comments)
> > APR Lock Performance Test
> > ==============
> >
> > apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> > microseconds: 20054346 usec
> > apr_thread_mutex_t Tests
> > microseconds: 16979410 usec
> > apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> microseconds: 247538114 usec
> apr_thread_mutex_t Tests
> microseconds: 250328270 usec
This is a perfect example of what happens when you *don't* set
the concurrency level (on solaris). What's really happening here
is the threads are not being interleaved, but instead they just
run their entire 1-million iteration loop, mutex locking and unlocking
included, without any concurrency. Try this patch and you'll
see what I mean:
[This patch is for illustrative purposes only, not for CVS]
Index: testlockperf.c
===================================================================
RCS file: /home/cvspublic/apr/test/testlockperf.c,v
retrieving revision 1.2
diff -u -r1.2 testlockperf.c
--- testlockperf.c 2001/09/15 05:23:55 1.2
+++ testlockperf.c 2001/09/15 23:55:28
@@ -103,11 +103,13 @@
{
int i;
+ printf("thread %p started\n", thd);
for (i = 0; i < MAX_COUNTER; i++) {
apr_lock_acquire(inter_lock);
mutex_counter++;
apr_lock_release(inter_lock);
}
+ printf("thread %p done\n", thd);
return NULL;
}
@@ -115,11 +117,13 @@
{
int i;
+ printf("thread %p started\n", thd);
for (i = 0; i < MAX_COUNTER; i++) {
apr_thread_mutex_lock(thread_lock);
mutex_counter++;
apr_thread_mutex_unlock(thread_lock);
}
+ printf("thread %p done\n", thd);
return NULL;
}
@@ -127,11 +131,13 @@
{
int i;
+ printf("thread %p started\n", thd);
for (i = 0; i < MAX_COUNTER; i++) {
apr_lock_acquire_rw(inter_rwlock, APR_WRITER);
mutex_counter++;
apr_lock_release(inter_rwlock);
}
+ printf("thread %p done\n", thd);
return NULL;
}
@@ -139,11 +145,13 @@
{
int i;
+ printf("thread %p started\n", thd);
for (i = 0; i < MAX_COUNTER; i++) {
apr_thread_rwlock_wrlock(thread_rwlock);
mutex_counter++;
apr_thread_rwlock_unlock(thread_rwlock);
}
+ printf("thread %p done\n", thd);
return NULL;
}
-aaron
Re: [proposal] apr_thread_setconcurrency()
Posted by Ian Holsman <ia...@cnet.com>.
Ian Holsman wrote:
> Aaron Bannert wrote:
>
>> I'd like to propose we add a call that gives a hint to the OS as to
>> the level of concurrency we wish to have. This would mirror
>> pthread_setconcurrency(), and would be a simple call to that on
>> operating systems that have it available. On other platforms it
>> would be simple noop.
>>
>> Give me some +1s and I'll submit a patch.
>>
>> -aaron
>>
>
> +1 IF the number you set it to is a hint, and solaris can changes the
> concurrency afterwards according to the load on the system/internal
> guidelines.
>
> Oh...
> I ran the the testlockperf code on the 8-way box, with
> the pthread_setconcurrency calls commented out, and with
> the concurrency calls put in (setting them to 8).
> results are as follows
>
> (without setconcurrency)
> APR Lock Performance Test
> ==============
>
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> Initializing the apr_lock_t OK
> Starting all the threads OK
> microseconds: 9373710 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> Starting all the threads OK
> microseconds: 7304314 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> Initializing the apr_lock_t OK
> microseconds: 11247506 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
>
> microseconds: 8148914 usec
>
> (with pthread_setconcurrency(8) where you put the comments)
> APR Lock Performance Test
> ==============
>
> apr_lock(INTRAPROCESS, MUTEX) Lock Tests
> Initializing the apr_lock_t OK
> Starting all the threads OK
> microseconds: 20054346 usec
> apr_thread_mutex_t Tests
> Initializing the apr_thread_mutex_t OK
> Starting all the threads OK
> microseconds: 16979410 usec
> apr_lock(INTRAPROCESS, READWRITE) Lock Tests
> Initializing the apr_lock_t OK
>
microseconds: 247538114 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
microseconds: 250328270 usec
(I didn't wait long enough)
> --It just sits at this point....
> (CVS Code is a couple of day old if that makes a difference)
Re: [proposal] apr_thread_setconcurrency()
Posted by Ian Holsman <ia...@cnet.com>.
Aaron Bannert wrote:
> I'd like to propose we add a call that gives a hint to the OS as to
> the level of concurrency we wish to have. This would mirror
> pthread_setconcurrency(), and would be a simple call to that on
> operating systems that have it available. On other platforms it
> would be simple noop.
>
> Give me some +1s and I'll submit a patch.
>
> -aaron
>
+1 IF the number you set it to is a hint, and solaris can changes the
concurrency afterwards according to the load on the system/internal
guidelines.
Oh...
I ran the the testlockperf code on the 8-way box, with
the pthread_setconcurrency calls commented out, and with
the concurrency calls put in (setting them to 8).
results are as follows
(without setconcurrency)
APR Lock Performance Test
==============
apr_lock(INTRAPROCESS, MUTEX) Lock Tests
Initializing the apr_lock_t OK
Starting all the threads OK
microseconds: 9373710 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
Starting all the threads OK
microseconds: 7304314 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
Initializing the apr_lock_t OK
microseconds: 11247506 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
microseconds: 8148914 usec
(with pthread_setconcurrency(8) where you put the comments)
APR Lock Performance Test
==============
apr_lock(INTRAPROCESS, MUTEX) Lock Tests
Initializing the apr_lock_t OK
Starting all the threads OK
microseconds: 20054346 usec
apr_thread_mutex_t Tests
Initializing the apr_thread_mutex_t OK
Starting all the threads OK
microseconds: 16979410 usec
apr_lock(INTRAPROCESS, READWRITE) Lock Tests
Initializing the apr_lock_t OK
--It just sits at this point....
(CVS Code is a couple of day old if that makes a difference)