You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Aaron Bannert <aa...@ebuilt.com> on 2001/07/30 22:42:07 UTC

lock benchmarks on Solaris 8/sparc uniprocessor

Here are some benchmarks I performed on a Uniprocessor UltraSparc machine
running Solaris 8. The benchmarking code is the same that W. Richard
Stevens used in his UNIX Network Programming: Interprocess Communication,
Vol 2, Second Edition (See Appendix A, p. 463-466). I invite everyone
to perform these tests on their platforms in various configurations
(I *really* want to run these tests on a big 8-way sun box :)

Note that *nothing* in these tests will run faster in parallel, so
the single concurrency case will be optimal. This is good because it
means these tests maximally reflect the performance of the underlying
synchronization mechanisms, and are minimally scewed by the ability of
the machine to do the basic operation that we are serializing.

The numbers are all in seconds. Each test was performed 3 times and
averaged. The tests themselves consist of a number of concurrent workers
(threads or processes), each of which contends for a mutex.  Once the
mutex is acquired the active thread simply increments a counter and
unlocks. When the counter reaches 1 million, the process prints the time
delta and exits.

Multithreaded Results (aka PROCESS_PRIVATE)
-------------------------------------------------------------------------
Lock Mechanism            Concurrency      Total time (sec)
==============            ===========      ==========
pthread_mutex             1                0.4
pthread_mutex             2                0.7
pthread_mutex             3                1.1
pthread_mutex             4                1.5
pthread_mutex             5                1.8

pthread_rwlock            1                0.9
pthread_rwlock            2                1.9
pthread_rwlock            3                3.1
pthread_rwlock            4                4.5
pthread_rwlock            5                8.4

posix memory-based sem.   1                2.7
posix memory-based sem.   2                5.4
posix memory-based sem.   3                8.1
posix memory-based sem.   4                10.8
posix memory-based sem.   5                13.5

posix named sem.          1                7.5
posix named sem.          2                15.1
posix named sem.          3                22.7
posix named sem.          4                30.6
posix named sem.          5                38.5

SysV sem.                 1                4.0
SysV sem.                 2                8.6
SysV sem.                 3                12.5
SysV sem.                 4                16.5
SysV sem.                 5                21.0

SysV sem. w/ UNDO         1                4.7
SysV sem. w/ UNDO         2                9.5
SysV sem. w/ UNDO         3                14.5
SysV sem. w/ UNDO         4                19.1
SysV sem. w/ UNDO         5                23.8

fcntl()                   1                15.4
[thread concurrency greater than 1 on Solaris is not possible, since fcntl()
 can only lock between processes, not between threads in the same process.
 See below for the multiprocess fcntl() results.]


Multiprocess Results (aka PROCESS_SHARED)
-------------------------------------------------------------------------
Lock Mechanism            Concurrency      Total time (sec)
==============            ===========      ==========
pthread_mutex             1                0.4
pthread_mutex             2                0.8
pthread_mutex             3                1.1
pthread_mutex             4                1.4
pthread_mutex             5                1.8

pthread_rwlock            1                0.8
pthread_rwlock            2                1.5
pthread_rwlock            3                2.6
pthread_rwlock            4                4.3
pthread_rwlock            5                6.2

posix memory-based sem.   1                7.4
posix memory-based sem.   2                14.9
posix memory-based sem.   3                22.6
posix memory-based sem.   4                29.6
posix memory-based sem.   5                37.2

posix named sem.          1                7.7
posix named sem.          2                14.9
posix named sem.          3                22.4
posix named sem.          4                29.9
posix named sem.          5                37.4

SysV sem.                 1                4.1
SysV sem.                 2                8.4
SysV sem.                 3                12.0
SysV sem.                 4                16.1
SysV sem.                 5                20.3

SysV sem. w/ UNDO         1                5.0
SysV sem. w/ UNDO         2                9.8
SysV sem. w/ UNDO         3                14.4
SysV sem. w/ UNDO         4                19.3
SysV sem. w/ UNDO         5                23.7

fcntl()                   1                15.4
fcntl()                   2                40.6
fcntl()                   3                61.2
fcntl()                   4                89.0
fcntl()                   5                118.8
[Note: the lock file used here was in the /tmp directory. Lock files
 on a non-RAM based filesystem were significantly slower, and lock
 files on an NFS partition was even worse than that.]


Commentary:
---------------------
>>From the perspective of APR, choosing the correct underlying lock
mechanism can be very difficult. Trying to match a general-use
mutual exclusion mechanism to a particular platform with a particular
configuration may be too many variables to deal with at build-time (or
even run-time).  I'm not making any assertions here about which locking
mechanisms we should or should not be using, but I think we should gather
some more data and revisit this problem.

When we look at this merely from the perspective of solving the
accept() mutex problem in httpd, we have fewer variables to deal with
(CROSS_PROCESS vs.  LOCKALL), but the essence of the problem still
remains. The above results don't reflect other versions of Solaris,
nor do they reflect what happens on a parallel processor machine. My
hope is that this will give us something to chew on for awhile.

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Ian Holsman <ia...@cnet.com>.
Aaron Bannert wrote:

>On Tue, Jul 31, 2001 at 02:32:48PM -0700, Aaron Bannert wrote:
>
>>Would it be prudent for APR to provide a shared-memory implementation of
>>posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
>>being available on a particular platform if we handle our own shared
>>memory allocation. Are there any known caveats to this type of an
>>implementation?
>>
>
>Er, I'm smoking crack here or something. Of course we're already doing
>it this way, I just didn't notice before. *smack*
>
>Are there any differences between that and using a SysV shmem
>implementation? I'm a relative newbie when it comes to how portable
>subsystems like this are.
>
>-aaron
>
If you could implement a solaris-specific set of apr_shmem_* functions 
the shared-process locking
would make use of them. (ie .. replace 'mm' )

..Ian


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Ian Holsman <ia...@cnet.com>.
Aaron Bannert wrote:

>On Tue, Jul 31, 2001 at 02:32:48PM -0700, Aaron Bannert wrote:
>
>>Would it be prudent for APR to provide a shared-memory implementation of
>>posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
>>being available on a particular platform if we handle our own shared
>>memory allocation. Are there any known caveats to this type of an
>>implementation?
>>
>
>Er, I'm smoking crack here or something. Of course we're already doing
>it this way, I just didn't notice before. *smack*
>
>Are there any differences between that and using a SysV shmem
>implementation? I'm a relative newbie when it comes to how portable
>subsystems like this are.
>
>-aaron
>
If you could implement a solaris-specific set of apr_shmem_* functions 
the shared-process locking
would make use of them. (ie .. replace 'mm' )

..Ian


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Tue, Jul 31, 2001 at 02:32:48PM -0700, Aaron Bannert wrote:
> Would it be prudent for APR to provide a shared-memory implementation of
> posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
> being available on a particular platform if we handle our own shared
> memory allocation. Are there any known caveats to this type of an
> implementation?

Er, I'm smoking crack here or something. Of course we're already doing
it this way, I just didn't notice before. *smack*

Are there any differences between that and using a SysV shmem
implementation? I'm a relative newbie when it comes to how portable
subsystems like this are.

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Tue, Jul 31, 2001 at 02:32:48PM -0700, Aaron Bannert wrote:
> Would it be prudent for APR to provide a shared-memory implementation of
> posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
> being available on a particular platform if we handle our own shared
> memory allocation. Are there any known caveats to this type of an
> implementation?

Er, I'm smoking crack here or something. Of course we're already doing
it this way, I just didn't notice before. *smack*

Are there any differences between that and using a SysV shmem
implementation? I'm a relative newbie when it comes to how portable
subsystems like this are.

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
I've run some more tests with much higher concurrency (so far only on
my uniprocessor solaris 8/sparc machine, but from preliminary results
from Ian's 8-way sun box things only get worse with more CPUs). I've
tried to match some usage pattern from each of our major MPMs, described
here:

- prefork: one listener/worker per process, many processes.

- threaded: multiple listeners/workers per process, many processes but
            many less than prefork.

- worker: single listener, multiple workers per process, similiar number of
          processes to threaded.


I ran these three tests, each with 50 concurrent {threads,processes} that
each contend for a lock, increment a counter, and unlock; exiting after
the counter has reached 1 million:

pthread_mutex across threads:        18.5 sec
    -- applicable to threaded and worker

pthread_mutex across processes:      18.0 sec
    -- applicable to threaded, prefork, and worker

fcntl() across processes:            2790.2 sec (46.5 minutes!!)
    -- applicable to threaded, prefork, and worker


My interpretation of this is that the overhead incurred on acquiring
and releasing a lock 1 million times is somewhere around 2 orders of
magnitude more on fcntl() than the overhead for the same operation using
a posix mutex.  At first thought this may seem like an extreme case, but
given a high request-load there will be on the order of n LWPs waiting
on the same accept lock in both of the prefork and threaded MPMs (where
n is the number of processes * workers/process).

Given these results, it is clear to me that we should attempt to use
posix mutexes whenever possible (even moreso on large n-way machines,
as fcntl()'s exponential growth seems to increase it's ascent with each
new processor). This may only be true for Solaris (8/sparc), but I think
that in order to properly evaluate other platforms we'll need to run
similiar tests.

Would it be prudent for APR to provide a shared-memory implementation of
posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
being available on a particular platform if we handle our own shared
memory allocation. Are there any known caveats to this type of an
implementation?

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Mon, Jul 30, 2001 at 01:42:07PM -0700, Aaron Bannert wrote:
> Here are some benchmarks I performed on a Uniprocessor UltraSparc machine
[...]

For the curious, the code can be downloaded from:
http://www.kohala.com/start/unpv22e/unpv22e.tar.gz

(which is a link from the main book page at:
http://www.kohala.com/start/unpv22e/unpv22e.html )


The test I ran is bench/incr.sh, just prettyified for the mailing list :)

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
I've run some more tests with much higher concurrency (so far only on
my uniprocessor solaris 8/sparc machine, but from preliminary results
from Ian's 8-way sun box things only get worse with more CPUs). I've
tried to match some usage pattern from each of our major MPMs, described
here:

- prefork: one listener/worker per process, many processes.

- threaded: multiple listeners/workers per process, many processes but
            many less than prefork.

- worker: single listener, multiple workers per process, similiar number of
          processes to threaded.


I ran these three tests, each with 50 concurrent {threads,processes} that
each contend for a lock, increment a counter, and unlock; exiting after
the counter has reached 1 million:

pthread_mutex across threads:        18.5 sec
    -- applicable to threaded and worker

pthread_mutex across processes:      18.0 sec
    -- applicable to threaded, prefork, and worker

fcntl() across processes:            2790.2 sec (46.5 minutes!!)
    -- applicable to threaded, prefork, and worker


My interpretation of this is that the overhead incurred on acquiring
and releasing a lock 1 million times is somewhere around 2 orders of
magnitude more on fcntl() than the overhead for the same operation using
a posix mutex.  At first thought this may seem like an extreme case, but
given a high request-load there will be on the order of n LWPs waiting
on the same accept lock in both of the prefork and threaded MPMs (where
n is the number of processes * workers/process).

Given these results, it is clear to me that we should attempt to use
posix mutexes whenever possible (even moreso on large n-way machines,
as fcntl()'s exponential growth seems to increase it's ascent with each
new processor). This may only be true for Solaris (8/sparc), but I think
that in order to properly evaluate other platforms we'll need to run
similiar tests.

Would it be prudent for APR to provide a shared-memory implementation of
posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
being available on a particular platform if we handle our own shared
memory allocation. Are there any known caveats to this type of an
implementation?

-aaron


Re: lock benchmarks on Solaris 8/sparc uniprocessor

Posted by Aaron Bannert <aa...@ebuilt.com>.
On Mon, Jul 30, 2001 at 01:42:07PM -0700, Aaron Bannert wrote:
> Here are some benchmarks I performed on a Uniprocessor UltraSparc machine
[...]

For the curious, the code can be downloaded from:
http://www.kohala.com/start/unpv22e/unpv22e.tar.gz

(which is a link from the main book page at:
http://www.kohala.com/start/unpv22e/unpv22e.html )


The test I ran is bench/incr.sh, just prettyified for the mailing list :)

-aaron