You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by gr...@apache.org on 2002/09/17 18:24:05 UTC
Re: custom linux kernel builds
Ian Holsman wrote:
>
> Hi Greg,
>
> we are about to start into the wild wild world of linux, and I was
> wondering if you have any hints on what patches you would go with for a
> custom kernel to get maximum performance.. stuff like ingo's O(1)
> scheduler and the like..
I'm glad you asked, since I've been looking at scalability issues with Linux
lately. Sorry for the long post - hit "delete" if you aren't interested.
We did some Linux benchmarking in a configuration similar to a reverse proxy
that takes disk file I/O and directory_walk out of the picture. We started with
the 2.0.40 worker MPM on Red Hat 7.2 with a 2.4.9 kernel on a 2 way SMP. We
tried numerous combinations of ThreadsPerChild and process limits such that
there were always a constant 1000 worker threads active in the server, maxed out
the CPUs, and ran oprofile 0.3. We got the best throughput at 200
threads/process. Then I took the oprofile sample counts for every
binary/library that used over 1% of the CPU, broke those down by function, and
then scaled the results to the throughput. The results should be proportional
to CPU cycles per request by function. Here are the heavy hitters:
threads per process binary/ function name
20 200 500 library
-- --- ---
230247 38334 66517 kernel schedule
1742 31336 75763 libpthread __pthread_alt_unlock
7447 6563 7341 libc memcpy
6661 6019 7109 libc _IO_vfscanf
5614 5388 5893 libc strlen
88825 5060 14296 kernel stext_lock
1276 4043 6210 kernel do_select
3994 3933 4239 kernel tcp_sendmsg
2761 3917 4071 libc chunk_alloc
4285 3606 3829 libapr apr_palloc
disclaimer: the tests weren't rigidly controlled
The 5 and 6 digit numbers above are the most bothersome. Bill S and Jeff told
me about Ingo Molnar's O(1) scheduler patch. Reviewing the code before and
after this patch, I believe it will make a huge improvement in schedule()
cycles. The older scheduler spends a lot of time looping thru the entire run
queue comparing "goodness" values in order to decide which task (process or
thread) is best to dispatch. That's gone with Ingo's O(1) patch. It waits until
a task looses its time slice to recompute an internal priority value, and picks
the highest priority ready task using just a few instructions. It turns out
that Red Hat 7.3 and Red Hat Advanced Server 2.1 already have this patch
included, so this one should be easy to solve. dunno about other distros.
__pthread_alt_unlock() loops through a list of threads that are blocked on a
mutex to find the thread with the highest priority. I don't know which mutex
this is; I'm guessing it's the one associated with worker's condition variable.
The ironic thing is that for httpd all the threads have the same priority AFAIK,
so these cycles don't do us much good. I'm not aware of a patch to improve
this, so I think our choices for scalability in the mean time are keeping
ThreadsPerChild
very low or prefork.
The stext_lock() cycles mean that we're getting contention on some kernel
spinlock. I don't know which lock yet. The scheduler uses spinlocks on SMPs,
and the stext_lock cycles above sort of track the scheduler cycles so I'm hoping
that might be it. There's a tool called lockmeter on SourceForge that can
provide spinlock usage statistics in case those cycles don't go away with the
O(1) scheduler patch.
Since we also had problems getting core dumps reliably with a threaded MPM in
addition to the pthread scalability issue, we decided to switch over to
prefork. That gives better SPECWeb99 results at the moment too. Then we
started hitting seg faults in pthread initialization code in child processes
during fork() when trying to start 2000 processes.
It turns out that dmesg had tons of "ldt allocation failure" messages.
linuxthreads uses the ldt on i386 to address its structure representing the
current thread. Since apr is linked with threads enabled by default on Linux,
each child is assigned a 64K ldt out of kernel memory with only one entry for
thread 0 out of 8K used. 64K may not seem like much these days, but one one
machine we had 900M of RAM (according to free) and were trying for 10,000
concurrent connections, which works out to 90K of RAM per process with prefork.
Configuring apr with --disable-threads makes the ldts a non-issue, but that
raises a concern for reliability of binaries when there are 3rd party modules
such as Vignette which are threaded. It's pretty easy to patch the kernel to
reduce the size of the ldts from the i386 architected max of 8K entries each.
That reduces that maximum number of threads per process (which might not be a
bad thing for httpd at the moment), and of course there will be a lot of users
unwilling to rebuild the Linux kernel. With either --disable-threads in apr or
the ldts limited to 256 entries in the kernel, it's no problem starting 10,000
child processes. You can also give the kernel a bigger chunk of RAM, but I
decided not to take away memory from user space on the box with 900M.
I've heard rumors of a patch that make coredumps more reliable with threads, but
don't know any details. If that pans out, maybe the answer for scalability +
reliability + no user custom builds is to go with worker with a small
ThreadsPerProcess number.
Greg
Re: custom linux kernel builds
Posted by gr...@apache.org.
Brian Pane wrote:
>
> gregames@apache.org wrote:
>
> >__pthread_alt_unlock() loops through a list of threads that are blocked on a
> >mutex to find the thread with the highest priority. I don't know which mutex
> >this is; I'm guessing it's the one associated with worker's condition variable.
> The leader/follower MPM might solve this problem. It doesn't share one
> big condition variable among all the idle workers; instead, each worker
> has its own condition var.
sounds promising.
Also, I wonder if it will reduce context switching noticably. That might help
with schedule() cycles. I guess it depends on how frequently worker's listener
thread blocks/unblocks in accept() when I pound it.
> If you have time to try the 2.0.41 version
> of leader/follower in your test environment, I'd be really interested
> in hearing whether it fixes the __pthread_alt_unlock() bottleneck.
I won't be able to try it near term, but hope to before too long. It will
probably be on a uniprocessor, but the problem ought to show up there as well.
Greg
Re: custom linux kernel builds
Posted by Brian Pane <br...@apache.org>.
gregames@apache.org wrote:
>__pthread_alt_unlock() loops through a list of threads that are blocked on a
>mutex to find the thread with the highest priority. I don't know which mutex
>this is; I'm guessing it's the one associated with worker's condition variable.
>The ironic thing is that for httpd all the threads have the same priority AFAIK,
>so these cycles don't do us much good. I'm not aware of a patch to improve
>this, so I think our choices for scalability in the mean time are keeping
>ThreadsPerChild
>very low or prefork.
>
The leader/follower MPM might solve this problem. It doesn't share one
big condition variable among all the idle workers; instead, each worker
has its own condition var. If you have time to try the 2.0.41 version
of leader/follower in your test environment, I'd be really interested
in hearing whether it fixes the __pthread_alt_unlock() bottleneck.
(Note: It needs to be built with "./configure
--enable-nonportable-atomics=yes"
for best results on x86.)
Brian