You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by "William A. Rowe, Jr." <wr...@rowe-clan.net> on 2003/07/30 18:21:21 UTC

[PATCH] Thread Locks and SMP boxes

Attached is a patch that appears to be as close as thread-safe
(but altogether not reentrant) as I can accomplish.

The existing unix/thread_mutex.c code is neither thread-safe
nor reentrant when using the 'nested' (default) thread mutexes.
The bug is most often expressed on SMP boxes, where they
are so massively parallel that the subtle flaws of modifying the
mutex object AFTER we have released the lock become very
apparent.

Examples include a number of error (45) Deadlock Avoided
alerts using Apache2/mod_ssl on Solaris, where there are both
internal thread locks and global locks which contain a nested
thread lock.

Many more eyes on this code would be welcomed, I expect to
discover that we repeat this mistake elsewhere in the mutex
code...

Bill

Re: [PATCH] Thread Locks and SMP boxes

Posted by "William A. Rowe, Jr." <wr...@apache.org>.
At 06:07 PM 7/30/2003, Joe Orton wrote:
>On Wed, Jul 30, 2003 at 11:21:21AM -0500, William Rowe wrote:
>> 
>> The existing unix/thread_mutex.c code is neither thread-safe
>> nor reentrant when using the 'nested' (default) thread mutexes.
>
>Why do you say "nested (default)"? The pools code is the only place I
>can find which actually uses nested mutexes. By default they are not
>nested.

You are correct - it's a platform choice, and Win32 chooses the nested
as a default because critical sections are much faster.  I made 
a misassumption on this point.


>> The bug is most often expressed on SMP boxes, where they
>> are so massively parallel that the subtle flaws of modifying the
>> mutex object AFTER we have released the lock become very
>> apparent.
>>
>> Examples include a number of error (45) Deadlock Avoided
>> alerts using Apache2/mod_ssl on Solaris, where there are both
>> internal thread locks and global locks which contain a nested
>> thread lock.
>
>This code looks like it is making lots of dubious assumptions:
>
>- pthread_t can be a structure, that's why the memset is used I presume,
>assinging 0 to a pthread_t definitely wins no prizes

We are assigning the pthread_t to the result of apr_os_thread_current()
so I presumed it's intregal enough to accept a 0/NULL reset.  memset
is obviously less atomic than such a reset, using many more cycles.

>- who says a memset-to-zero'ed pthread_t isn't a valid pthread_t?  No
>reason why it couldn't be.  If it is *not* a valid pthread_t, then
>passing it to pthread_equal gives undefined behaviour anyway!

This point *does* scare me, and it suggests some sort of requirement
for an APR_THREAD_ID_NONE that's guarenteed to be not-a-thread.

>- let's hope and pray that pthread_equal() is an atomic operation

Good point, and point taken.

>> Many more eyes on this code would be welcomed, I expect to
>> discover that we repeat this mistake elsewhere in the mutex
>> code...
>
>Like Aaron I'm not sure why the comments about reentrancy are relevant:
>you can't use pthread mutexes (and hence apr mutexes :) from a signal
>handler, anyone doing that is SOL.

So we should spell that out as a prerequisite.

>You've made lots of changes in the patch:
>
>1. changing some incr++ to ++incr which makes no differences

The result is add-and-return, as opposed to evalute and increment.
The result will be optimized away since it's unused as an rvalue,
but there always is a difference.

Evaluting some bytecode might be amusing.

>2. re-ordered the mutex_unlock code so that the ownership changes occur
>inside the critical section.

This is the *critical* point of the patch, and one that I would commit first
and prior to any other "legibility" or other insignificant aspects of the patch.

>3. added some "protection" against races
>
>1 and 3 look bogus; 2 looks like the important change here.  The only
>effect of your (3) changes would seem to be to hide races if there still
>are any?

No, the next step in 3 is to actually provide a mechanism for throwing
an exception when those cases are encountered.

>(2) does introduce a semantic change in that the caller cannot rely on
>pthread_mutex_unlock() having defined behaviour if called from a thread
>which does not own the mutex lock, but I expect that's no big deal.

If we are crossing ownership due to this or other races or reentrancy,
this is a significant change, but it should probably become a fatal exception.

>It's worth noting that most modern POSIX implementations offer
>"recursive" mutexes in which pthread_mutex_lock/unlock do all this work
>internally.

Which should be detected with autoconf.  As I invited, more eyes would
be much appreciated.

Bill


Re: [PATCH] Thread Locks and SMP boxes

Posted by Wan-Teh Chang <wt...@netscape.com>.
Joe Orton wrote:

 > This code looks like it is making lots of dubious assumptions:
 >
 > - pthread_t can be a structure, that's why the memset is used I presume,
 > assinging 0 to a pthread_t definitely wins no prizes
 > - who says a memset-to-zero'ed pthread_t isn't a valid pthread_t?  No
 > reason why it couldn't be.  If it is *not* a valid pthread_t, then
 > passing it to pthread_equal gives undefined behaviour anyway!

Agreed.  In some implementations pthread_t is an integer
index, and 0 could be a valid pthread_t (just like 0 is a
valid file descriptor on Unix).  So you can't assume that
0 is an invalid pthread_t.

Wan-Teh


Re: [PATCH] Thread Locks and SMP boxes

Posted by Joe Orton <jo...@manyfish.co.uk>.
On Wed, Jul 30, 2003 at 11:21:21AM -0500, William Rowe wrote:
> Attached is a patch that appears to be as close as thread-safe
> (but altogether not reentrant) as I can accomplish.
> 
> The existing unix/thread_mutex.c code is neither thread-safe
> nor reentrant when using the 'nested' (default) thread mutexes.

Why do you say "nested (default)"? The pools code is the only place I
can find which actually uses nested mutexes. By default they are not
nested.
 
> The bug is most often expressed on SMP boxes, where they
> are so massively parallel that the subtle flaws of modifying the
> mutex object AFTER we have released the lock become very
> apparent.
>
> Examples include a number of error (45) Deadlock Avoided
> alerts using Apache2/mod_ssl on Solaris, where there are both
> internal thread locks and global locks which contain a nested
> thread lock.

This code looks like it is making lots of dubious assumptions:

- pthread_t can be a structure, that's why the memset is used I presume,
assinging 0 to a pthread_t definitely wins no prizes
- who says a memset-to-zero'ed pthread_t isn't a valid pthread_t?  No
reason why it couldn't be.  If it is *not* a valid pthread_t, then
passing it to pthread_equal gives undefined behaviour anyway!
- let's hope and pray that pthread_equal() is an atomic operation

> Many more eyes on this code would be welcomed, I expect to
> discover that we repeat this mistake elsewhere in the mutex
> code...

Like Aaron I'm not sure why the comments about reentrancy are relevant:
you can't use pthread mutexes (and hence apr mutexes :) from a signal
handler, anyone doing that is SOL.

You've made lots of changes in the patch:

1. changing some incr++ to ++incr which makes no differences
2. re-ordered the mutex_unlock code so that the ownership changes occur
inside the critical section.
3. added some "protection" against races

1 and 3 look bogus; 2 looks like the important change here.  The only
effect of your (3) changes would seem to be to hide races if there still
are any?

(2) does introduce a semantic change in that the caller cannot rely on
pthread_mutex_unlock() having defined behaviour if called from a thread
which does not own the mutex lock, but I expect that's no big deal.

It's worth noting that most modern POSIX implementations offer
"recursive" mutexes in which pthread_mutex_lock/unlock do all this work
internally.

joe

Re: [PATCH] Thread Locks and SMP boxes

Posted by "William A. Rowe, Jr." <wr...@apache.org>.
At 06:35 PM 7/30/2003, Joe Orton wrote:
>On Wed, Jul 30, 2003 at 01:34:41PM -0500, William Rowe wrote:
>...
>> Because the new patch protects the uninitalization of the mutex while
>> the lock is still held, the only failure scenario that remains is;
>> 
>> 1. thread is interrupted (e.g. signal handler) in between the unsetting
>>    of the ownership (and decrement of the refcount) and actually releasing
>>    the mutex.  The interrupt handler attempts to perform a nested lock
>>    and deadlocks because the ownership has already been reset, but
>>    the lock is not yet released.
>
>Why are you worried about signal handlers using mutexes? Is httpd doing
>this somewhere?

No.  I'm simply pedantic and a paranoid.  Actually, I really get ticked off
by the attitiude expressed by a few individuals that "if the Apache HTTP 
Web Server doesn't do this, APR doesn't need [to deal with] this feature."

Hopefully, at some point, APR will have some applicability beyond HTTP.
In the meantime, it's not unreasonable to deal with the issues that others
might have to confront.

Since the few responses are positive <--> antagonistic, I'll commit both 
the simple and obvious point (2. that you pointed out in an earlier mail)
and the decorative and subtle (1. & 3.) in two separate commits early
tomorrow.

Bill 


Re: [PATCH] Thread Locks and SMP boxes

Posted by Joe Orton <jo...@manyfish.co.uk>.
On Wed, Jul 30, 2003 at 01:34:41PM -0500, William Rowe wrote:
...
> Because the new patch protects the uninitalization of the mutex while
> the lock is still held, the only failure scenario that remains is;
> 
> 1. thread is interrupted (e.g. signal handler) in between the unsetting
>    of the ownership (and decrement of the refcount) and actually releasing
>    the mutex.  The interrupt handler attempts to perform a nested lock
>    and deadlocks because the ownership has already been reset, but
>    the lock is not yet released.

Why are you worried about signal handlers using mutexes? Is httpd doing
this somewhere?

Re: [PATCH] Thread Locks and SMP boxes

Posted by Aaron Bannert <aa...@clove.org>.
Comments below...


On Wednesday, July 30, 2003, at 09:21  AM, William A. Rowe, Jr. wrote:

> Attached is a patch that appears to be as close as thread-safe
> (but altogether not reentrant) as I can accomplish.
>
> The existing unix/thread_mutex.c code is neither thread-safe
> nor reentrant when using the 'nested' (default) thread mutexes.
> The bug is most often expressed on SMP boxes, where they
> are so massively parallel that the subtle flaws of modifying the
> mutex object AFTER we have released the lock become very
> apparent.
>
> Examples include a number of error (45) Deadlock Avoided
> alerts using Apache2/mod_ssl on Solaris, where there are both
> internal thread locks and global locks which contain a nested
> thread lock.
>
> Many more eyes on this code would be welcomed, I expect to
> discover that we repeat this mistake elsewhere in the mutex
> code...

> Index: srclib/apr/thread_mutex.c
> ===================================================================
> RCS file: /home/cvs/apr/locks/unix/thread_mutex.c,v
> retrieving revision 1.18
> diff -u -r1.18 thread_mutex.c
> --- srclib/apr/thread_mutex.c	8 Jun 2003 13:42:20 -0000	1.18
> +++ srclib/apr/thread_mutex.c	30 Jul 2003 16:09:41 -0000
> @@ -108,8 +108,11 @@
>      apr_status_t rv;
>
>      if (mutex->nested) {
> +        /*
> +         * WARNING: although threadsafe, this test is NOT reentrant
> +         */

I don't know what you mean by reentrant here, could you explain?

>          if (apr_os_thread_equal(mutex->owner, 
> apr_os_thread_current())) {
> -            mutex->owner_ref++;
> +            ++mutex->owner_ref;

Does this make a difference? (I think an incr is not threadsafe any way 
we
look at it, but I could be wrong (It's really a load/incr/store, 
right?).)

>              return APR_SUCCESS;
>          }
>
> @@ -121,8 +124,13 @@
>              return rv;
>          }
>
> +        if (++mutex->owner_ref != 1) {
> +            /* The owner_ref should be zero when the lock is not held,
> +             * if owner_ref was non-zero we have a mutex reference 
> bug.
> +             */
> +            mutex->owner_ref = 1;
> +        }
>          mutex->owner = apr_os_thread_current();
> -        mutex->owner_ref = 1;
>          return rv;
>      }
>      else {
> @@ -141,8 +149,11 @@
>      apr_status_t rv;
>
>      if (mutex->nested) {
> +        /*
> +         * WARNING: although threadsafe, this test is NOT reentrant
> +         */
>          if (apr_os_thread_equal(mutex->owner, 
> apr_os_thread_current())) {
> -            mutex->owner_ref++;
> +            ++mutex->owner_ref;
>              return APR_SUCCESS;
>          }
>
> @@ -154,8 +165,13 @@
>              return (rv == EBUSY) ? APR_EBUSY : rv;
>          }
>
> +        if (++mutex->owner_ref != 1) {
> +            /* The owner_ref should be zero when the lock is not held,
> +             * if owner_ref was non-zero we have a mutex reference 
> bug.
> +             */
> +            mutex->owner_ref = 1;
> +        }
>          mutex->owner = apr_os_thread_current();
> -        mutex->owner_ref = 1;
>      }
>      else {
>          rv = pthread_mutex_trylock(&mutex->mutex);
> @@ -176,31 +192,30 @@
>
>      if (mutex->nested) {
>          if (apr_os_thread_equal(mutex->owner, 
> apr_os_thread_current())) {
> -            mutex->owner_ref--;
> -            if (mutex->owner_ref > 0)
> +            /*
> +             * WARNING: although threadsafe, this test is NOT 
> reentrant
> +             */
> +            if (--mutex->owner_ref > 0)
>                  return APR_SUCCESS;
>          }
> -        status = pthread_mutex_unlock(&mutex->mutex);
> -        if (status) {
> -#ifdef PTHREAD_SETS_ERRNO
> -            status = errno;
> -#endif
> -            return status;
> -        }
> +        mutex->owner = 0;
>
> -        memset(&mutex->owner, 0, sizeof mutex->owner);
> -        mutex->owner_ref = 0;
> -        return status;
> +        if (mutex->owner_ref < 0) {
> +            /* The owner_ref should be zero when the lock is finally 
> released,
> +             * if owner_ref is non-zero we have a mutex reference bug.
> +             */
> +            mutex->owner_ref = 0;
> +        }
>      }
> -    else {
> -        status = pthread_mutex_unlock(&mutex->mutex);
> +
> +    status = pthread_mutex_unlock(&mutex->mutex);
>  #ifdef PTHREAD_SETS_ERRNO
> -        if (status) {
> -            status = errno;
> -        }
> -#endif
> -        return status;
> +    if (status) {
> +        status = errno;
>      }
> +#endif
> +
> +    return status;
>  }
>
>  APR_DECLARE(apr_status_t) apr_thread_mutex_destroy(apr_thread_mutex_t 
> *mutex)


This last chunk seems to be the meat of at least one of the problems.
Still, even though this change is good in that we only mess with
the mutex->owner stuff when the lock is held, we aren't protecting
mutex->owner in other places (eg. in the places above that you pointed 
out)

Do we really need nested mutexes? The correct solution might be to add 
another
internal pthread_mutex that only gets used when we want nested mutexes,
and only gets locked when we want to read or modify those mutex->owner
variables. This might make things a little slower than they would be
now (only slightly), but it would be correct if nested mutexes are
really a necessary feature.

-aaron


Re: [PATCH] Thread Locks and SMP boxes

Posted by "William A. Rowe, Jr." <wr...@apache.org>.
Explaining the failure pattern;

Once a second thread blocked against the old pthread_unlock
statement, and a second thread finally released that lock, one
of two failure conditions occurred...

1. the original owner had an implicit yield timeslice to the new
   acquirer of the mutex.  That 2nd thread which obtained the
   pthread_mutex_lock would set it's ownership and initialize
   the refcount to one.

   When the original thread regained it's timeslice, it would UNSET
   the new threads ownership and refcount so the mutex appeared
   unowned.

   When the new thread attempted a nested thread lock, it wouldn't
   recognize the mutex owner, so it would deadlock.

2. On a massively parallel (SMP) box, the original thread releasing
   the mutex would not yeild.  The original and new threads would
   both race to unset and set the ownership, respectively.  This created
   a somewhat different race pattern.

Note the use of memset(&mutex, 0, sizeof mutex) further skewed the
behavior by using a very expensive call to unset what is usually a simple
pointer or int.

Because the new patch protects the uninitalization of the mutex while
the lock is still held, the only failure scenario that remains is;

1. thread is interrupted (e.g. signal handler) in between the unsetting
   of the ownership (and decrement of the refcount) and actually releasing
   the mutex.  The interrupt handler attempts to perform a nested lock
   and deadlocks because the ownership has already been reset, but
   the lock is not yet released.

This one remaining failure case is far more unlikely than our currently
possible host of issues.  I don't see a simple workaround to avoid
this last failure case.

Bill