You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apr.apache.org by Philip Martin <ph...@codematters.co.uk> on 2002/11/17 00:36:20 UTC

apr_proc_mutex is broken

Hello

The following issue was raised in the Subversion issue tracker

http://subversion.tigris.org/issues/show_bug.cgi?id=996

"While tracking down Bug 995, I've found out that apr's testglobalmutex
 and testprocmutex are failing for me when running in SMP mode on a
 Dual P3.  The same tests do not fail when running in Single Processor
 mode or running on my Dual Athlon in SMP mode.  I am not certain if
 this related to Bug 995, but it does appear that httpd is going into a
 deadlock and the subversion cmdline client eventually times out.  I've
 also noticed that apache leaks semaphores when this happens (from
 ipcs)."

I'm not really sure why the submitter chose to report it to
Subversion, and not (as far as I can tell) to APR, but I took a look
at the proc_mutex code and there does appear to be a bug.

The documentation for apr_proc_mutex_create states that a proc mutex
is intended to synchronize processes.  Consider

APR_DECLARE(apr_status_t) apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
{
    apr_status_t rv;

#if APR_HAS_THREADS
    if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
        mutex->owner_ref++;
        return APR_SUCCESS;
    }
#endif

    if ((rv = mutex->meth->acquire(mutex)) != APR_SUCCESS) {
        return rv;
    }

#if APR_HAS_THREADS
    mutex->owner = apr_os_thread_current();
    mutex->owner_ref = 1;
#endif

    return APR_SUCCESS;
}

The first problem is the line
    if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
where there is access to the shared data mutex->owner without any sort
of synchronization.  Now mutex->owner may not be an atomic type, in
which case a totally bogus value could be obtained, and if it is an
atomic type the unsynchronized access is still pointless.

However the real problem is that this is supposed to be a *process*
lock, and yet it is comparing *thread* IDs.  Thread IDs are distinct
within a process, but there is no guarantee that they are distinct
across mutiple processes.  Comparing thread IDs from two separate
processes is undefined behaviour when using POSIX threads.  There is
at least one common platform (GNU glibc threads) where thread IDs are
duplicated.  When I run APR's testprocmutex test on a 2-way SMP Linux
box it regularly fails (that it doesn't always fail is, I suspect,
because it is not a particularly good test and so the processes often
complete without any mutex contention).

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Mon, Nov 18, 2002 at 08:20:11PM -0500, Jeff Trawick wrote:
> 2)  tweak make_child() in the test program to trap errors from
>     apr_proc_mutex_lock() and apr_proc_mutex_unlock() 
> 
> I bet you are hitting errors with the mutex, but they're not being
> reported right now.  And of course if the mutex doesn't work right the
> counter in shared memory is no longer interesting either.

Been working on adding in error checking today, I'll add what I have
as well as catch the errors from mutex locks and unlocks in soonish.

-aaron

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

<rb...@rkbloom.net> writes:

> If you don't want the child process to destroy the mutex, then you should
> kill that cleanup in the child process.  That is why we have the
> apr_pool_cleanup_kill API.

That would seem to be the answer to my "Maybe I'm missing something
obvious" statement in a different post :)

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Philip Martin wrote:

> <rb...@rkbloom.net> writes:
>
> > Follow my logic please.
>
> [snip explanation]
>
> I see, thanks.
>
> > The real fix for this problem is to fix the test code as I have stated
> > above so that it is a valid example of how to write APR code, and to fix
> > the library so that we register cleanups with a public function, and so
> > that apr_proc_mutex_child_init does the right thing in all cases.  I will
> > do the work later tonight (assuming my wife doesn't go into labor before
> > then) unless somebody beats me to it.
>
> I think the test should also be changed to better exercise the mutex
> behaviour, i.e. replace (*x)++ with "read, wait, increment, write".
> Had the test been written like this originally it would not have
> passed erroneously.  It may also help when/if someone ports APR to a
> new platform.

One of my projects is to re-implement the test suite using CuTest
(currently in the test directory).  I am about half way through.  Once I
have all of the current tests actually compiling and reporting useful
information, I will begin to crete more tests that really exercise APR.

The original goal for the APR test suite was that implementors could run
the suite and be sure that they had implemented the correct logic on their
platforms.  Unfortunately, the test suite has never really matured, so now
we have to play catch up.  :-(

Any help that people can give writing tests and porting the test suite is
always appreciated.  I am likely to re-write the CuTest internals soon,
because I am unhappy about how it does some stuff.  But, the API should
remain basically the same (although the names will most likely change).

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

<rb...@rkbloom.net> writes:

> Follow my logic please.

[snip explanation]

I see, thanks.

> The real fix for this problem is to fix the test code as I have stated
> above so that it is a valid example of how to write APR code, and to fix
> the library so that we register cleanups with a public function, and so
> that apr_proc_mutex_child_init does the right thing in all cases.  I will
> do the work later tonight (assuming my wife doesn't go into labor before
> then) unless somebody beats me to it.

I think the test should also be changed to better exercise the mutex
behaviour, i.e. replace (*x)++ with "read, wait, increment, write".
Had the test been written like this originally it would not have
passed erroneously.  It may also help when/if someone ports APR to a
new platform.

-- 
Philip Martin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Tuesday, November 19, 2002, at 01:51  PM, <rb...@rkbloom.net> wrote:
>
> 	apr_proc_mutex_child_init()
> 		Depending on mutex type, the cleanup should be killed or
> not.  In general, file-based mutex will be killed, but semaphores will
> not.
>
I don't understand this, why does the app treat the different types
differently, or are you saying that the cleanup killing does or
doesn't happen depending on the implementation?

> 2)  We aren't calling apr_proc_mutex_child_init, which is what makes 
> the
> mutex available to the child process.  We were relying on inheritance 
> to
> get the mutex to the child process.  If we had been calling this 
> function
> (and the function was fixed to kill the cleanup if appropriate), then 
> we
> would have masked the previous bug.

Child init on almost all of the proc mutex implementations is a no-op 
[on
unix, where this bug exists].

> At the end of the day, the test was bogus, and it wasn't good APR code,
> which is what was causing the problem.  There is one change that 
> should be
> made to the underlying APR code to help hide this from the user, but 
> the
> real problem is in the test.

I don't think anyone disagrees that there are problems in the test, but
I do strongly believe that the interface is also broken.

> But the latter is MUCH worse when it comes to locks.  You only have so
> many semaphores available to you.  A resource leak can be a real PITA 
> to
> track down, and I suspect that this general bug is biting Apache in a
> couple of places because we do leak semaphores.  I currently have 12
> semaphores in use by Apache.  When I stop the currently running 
> server, I
> have 9 in use (all by the web server).
>
> Resource leaks suck, and they are just as hard to diagnose as doing
> something too often.

This is not a valid reason, since these kinds of errors are due to
bugs in the application, NOT due to bad APR interfaces or 
implementations.

I don't think we should complicate the heck out of APR just to hold
the hands of poor app authors.

> The real fix for this problem is to fix the test code as I have stated
> above so that it is a valid example of how to write APR code, and to 
> fix
> the library so that we register cleanups with a public function, and so
> that apr_proc_mutex_child_init does the right thing in all cases.  I 
> will
> do the work later tonight (assuming my wife doesn't go into labor 
> before
> then) unless somebody beats me to it.

I'd much rather just have APR not do any automatic cleanups on any of
the mutex types, and to leave it up to the app author to decide if they
want to register a cleanup or would rather just call the destroy()
methods themselves.

-aaron

p.s. Give Kelly my best. :)

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Philip Martin wrote:

> <rb...@rkbloom.net> writes:
>
> >  As for the app needing to know about every use of
> > fork/apr_proc_fork in every library, that isn't true.  The app needs to
> > know about every use of apr_terminate in every library.  It is the
> > apr_terminate call that is causing the problem, not the actual fork()
> > call.
>
> APR states that apr_terminate must always be called, and suggests
> installing apr_terminate as an atexit handler.  The only APR
> application I've worked on (Subversion) does this.  This means every
> fork-and-exit is a potential problem.
>
> Suppose I writing a library that uses proc_mutex.  I do not know
> whether the application has installed apr_terminate as an atexit
> handler.

If you have a library that is calling apr_terminate in a process that you
didn't create, then you most likely have a completely different bug.  :-)
The apr_terminate function is going to call apr_pool_destroy(global_pool).
However, global_pool is a static variable, which means that you can't just
destroy it without causing massive side-effects.  To combat this, we have
a simple counter in apr_initialize/apr_terminate.  For every call to
apr_initialize, we expect a call to apr_terminate.  If one is called
without the other, things won't work properly.

If you have a library that is calling apr_initialize, then you can safely
call apr_terminate and not worry about deleting a mutex that somebody else
still needs.  However, if you have a library that is calling apr_terminate
without calling apr_initialize, then you have a bug.

> >  However, if the app removes the proc_mutex cleanup before the new
> > process is spawned, it won't have to worry about this, because the child
> > process won't have a cleanup to kill.
>
> So in practice some (lots? most? all?) libraries/applications will
> kill it.  Why bother registering it?  Let the user choose to register
> such a cleanup if required.

No, you are missing the point.  In some instances, the mutex cleanup will
be killed, but in others it won't.  There are two factors to consider when
this decision is made.  What is the underlying mutex mechanism and what
is the mutex being used for.  Regardless, the cleanup should be
registered.  In fact, the problem that you are seeing has nothing to do
with the cleanup being registered, that is a red herring.  The problem is
with the cleanup not being unregisted properly.

Follow my logic please.

In the parent process:
	apr_initialize()
		Sets static counter to 1.

	apr_proc_mutex_create()
		registers the cleanup for you.  At this point, you can
safely drop this mutex on the floor, and things will continue to work
forever.

	apr_terminate()
		destroys the global_pool, mutex is destroyed cleanly.

In the child process:
	apr_initialize()
		increments the counter by 1, now it is 2.

	apr_proc_mutex_child_init()
		Depending on mutex type, the cleanup should be killed or
not.  In general, file-based mutex will be killed, but semaphores will
not.

	apr_terminate()
		counter is decremented, setting it to 1.  _NO_ pool is
destroyed, thus no mutex is deleted.

Now, the test program had two problems.
1)  The child processes were calling apr_terminate, but they weren't
calling apr_initialize.  This left our static counter out of sync, which
is what caused the bug.

2)  We aren't calling apr_proc_mutex_child_init, which is what makes the
mutex available to the child process.  We were relying on inheritance to
get the mutex to the child process.  If we had been calling this function
(and the function was fixed to kill the cleanup if appropriate), then we
would have masked the previous bug.

At the end of the day, the test was bogus, and it wasn't good APR code,
which is what was causing the problem.  There is one change that should be
made to the underlying APR code to help hide this from the user, but the
real problem is in the test.

> There are two possible errors:
>
> - The application can fail to kill the cleanup, in which case a
>   problem may occur if apr_terminate runs.  Whether this is a problem
>   depends on what the child processes do, if they only exit in
>   exceptional circumstances the bug may not be readily apparent.
>
> - Or the application can fail to destroy the mutex, a fairly
>   conventional resource deallocation bug.
>
> I spent some time debugging a Subversion pool cleanup handler that
> erroneously deleted a temporary file when forking a child process, I
> know I would prefer the latter!

But the latter is MUCH worse when it comes to locks.  You only have so
many semaphores available to you.  A resource leak can be a real PITA to
track down, and I suspect that this general bug is biting Apache in a
couple of places because we do leak semaphores.  I currently have 12
semaphores in use by Apache.  When I stop the currently running server, I
have 9 in use (all by the web server).

Resource leaks suck, and they are just as hard to diagnose as doing
something too often.

The real fix for this problem is to fix the test code as I have stated
above so that it is a valid example of how to write APR code, and to fix
the library so that we register cleanups with a public function, and so
that apr_proc_mutex_child_init does the right thing in all cases.  I will
do the work later tonight (assuming my wife doesn't go into labor before
then) unless somebody beats me to it.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

<rb...@rkbloom.net> writes:

>  As for the app needing to know about every use of
> fork/apr_proc_fork in every library, that isn't true.  The app needs to
> know about every use of apr_terminate in every library.  It is the
> apr_terminate call that is causing the problem, not the actual fork()
> call.

APR states that apr_terminate must always be called, and suggests
installing apr_terminate as an atexit handler.  The only APR
application I've worked on (Subversion) does this.  This means every
fork-and-exit is a potential problem.

Suppose I writing a library that uses proc_mutex.  I do not know
whether the application has installed apr_terminate as an atexit
handler.  

>  However, if the app removes the proc_mutex cleanup before the new
> process is spawned, it won't have to worry about this, because the child
> process won't have a cleanup to kill.

So in practice some (lots? most? all?) libraries/applications will
kill it.  Why bother registering it?  Let the user choose to register
such a cleanup if required.

There are two possible errors:

- The application can fail to kill the cleanup, in which case a
  problem may occur if apr_terminate runs.  Whether this is a problem
  depends on what the child processes do, if they only exit in
  exceptional circumstances the bug may not be readily apparent.

- Or the application can fail to destroy the mutex, a fairly
  conventional resource deallocation bug.

I spent some time debugging a Subversion pool cleanup handler that
erroneously deleted a temporary file when forking a child process, I
know I would prefer the latter!

-- 
Philip Martin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On Tue, 19 Nov 2002, Aaron Bannert wrote:

> On Tuesday, November 19, 2002, at 09:49  PM, <rb...@rkbloom.net> wrote:
> > By allocating the mutex in the pool they have already stated that the
> > pool
> > should control the mutex's scope.  That is the meaning of allocating a
> > varibale inside a pool.  This is why pools are passed to ALL functions.
> >
> > BTW, you can't leave this up to the app.  The abstraction layer makes
> > it
> > too complex to have the app take care of this.  I thought at first that
> > the app should just kill the cleanup, but that is too hard.  If you
> > make
> > the app register the cleanup, you will have code like:
> >
> > if (mutex->type == (fcntl || flock)) {
> >     register_cleanup
> > else
> >     no-op
>
> Why is that? Perhaps that means that apr_proc_mutex_t is the wrong
> abstraction for these types.

It is because like all abstractions, this one isn't perfect.  However, it
is the best we have been able to find.  That is OK, but you have to
realize it.  Read this for more explanations about why abstractions leak:
http://www.joelonsoftware.com/articles/LeakyAbstractions.html.

The underlying technical reason is that some mutexes need to be cleaned up
in the child while others do not.  That is fine, but hide that from the
user.  The way to do that is with the logic changes that I have described
in detail already.

> > If you read ALL of those functions, they are all:
> >
> >     if ((rv = proc_mutex_(.*)_cleanup(mutex)) == APR_SUCCESS) {
> >         apr_pool_cleanup_kill(mutex->pool, mutex, (.*)_sysv_cleanup);
> >         return APR_SUCCESS;
> >     }
> >     return rv;
> >
> > which means that the cleanup as it is registered today is equivalent to
> > apr_proc_mutex_destroy.  The functions that they call are equivalent.
> > I
> > am making the change now to use a public function for the cleanup,
> > because
> > anything else is completely bogus.
>
> So there will have to be some internal changes (no critical interface
> changes)
> to make this work. Fine, but what will this break anything else?

Yes, and the internal changes are simple.  I have them mostly done, but I
got tired.  I'll try to finish them today sometime.  No promises, it may
take a few days.

> Let's talk about some use cases:
>
> 1) parent owns semaphore
>     children created and destroyed, after which parent destroys semaphore

This is the most common case, and thus the case that we default to.

> 2) parent creates semaphore and delegates ownership to a child

This is done by having the parent kill the cleanup (which is why the
cleanup must be done with a public function).  This is essentially
identical to what happens when you have _any_ resource that the parent
creates and then delegates to the child.  The parent must kill the
cleanup, and the child must register one.

> 3) one process creates semaphore, another non-related process attaches
> to
>     the semaphore, a third takes responsibility for destroying it
> ...

Same as case 2, except the child to register the cleanup is a different
process.

> To me it seems the only way to support all of these use cases would be
> to allow the app to explicitly create and destroy their own semaphores
> whenever they want without any magic cleanup handlers happen
> unexpectedly.

No.  This is just how APR works, and it is how it has always worked.  By
default, we install cleanup handlers, because that ensures that we don't
leak resources.  If APR didn't register the cleanup, then EVERY APR call
in Apache would look like:

apr_sometype_create(.....);
apr_pool_cleanup_register(....);

And, APR would be annoying to use, because the programmer would have to
really think about resources.  APR gives the most common case for free,
the one where the allocator owns the resource.  If you want to change
that, we have a rich pool API that allows you to do so.  However, in order
for that to work, APR must use the pool API properly, and it isn't in the
mutex code.

> > If you read my messages from earlier today, I have already answered
> > all of
> > the questions that you posted in each of the 5 messages you just sent.
> > But one more time:
> >
> > The parent process allocates the mutex out of whatever pool should
> > control
> > the scope.
>
> This is not in dispute. In testprocmutex, the mutex was created in the
> global
> pool, and since the parent global pool lives longer than the children
> (We call waitpid to synchronize the children) it was expected that the
> mutex would live until the end. This was a false assumption.

Aaron, you specifically stated in the message I was responding to that the
pool shouldn't control the scope.  That is the only reason I went into all
of this again.  Please, don't say that it isn't in dispute when you are
the one disputing it.

> > The child processesmust call apr_proc_mutex_child_init.  That function
> > defines the scope of the mutex for the child process.  If the mutex
> > type
> > requires a cleanup in the child process, then one is registered.
>
> Again, not in dispute. This still needs to be added, btw.

I know.

> > As for the bug that was hit.  Programs _MUST_ have an equal number of
> > apr_initialize and apr_terminate calls.  If it doesn't, then bugs like
> > this will happen.
>
> I'm glad we worked through and found this to be the problem, since it
> has shown me how broken this part of APR is. Could we solve this by
> having apr_proc_create() increment the global counter? Are there
> resources that will not be cleaned up in children if we always
> call ap_initialize() in child processes (eg. when a process dies
> on Windows, will it close all its files and sockets?)?

I am really sick of hearing you say "how broken ______ is."  Whenever you
find a part of the code that you didn't help to design/write, you complain
that it is horribly broken.  It isn't.  The apr_initialize/apr_terminate
code was added for a reason.  They need to balance each other out.  We
could have the apr_proc_(fork|create) calls call apr_initialize, but that
wouldn't solve this problem, because there is still fork(), which we have
no control over.  It would be much better to just document this problem
clearly.

Unless you can find a way to do this cleanly, hiding it behind the
apr_proc_foo calls is just going to make matters worse.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Tuesday, November 19, 2002, at 09:49  PM, <rb...@rkbloom.net> wrote:
> By allocating the mutex in the pool they have already stated that the 
> pool
> should control the mutex's scope.  That is the meaning of allocating a
> varibale inside a pool.  This is why pools are passed to ALL functions.
>
> BTW, you can't leave this up to the app.  The abstraction layer makes 
> it
> too complex to have the app take care of this.  I thought at first that
> the app should just kill the cleanup, but that is too hard.  If you 
> make
> the app register the cleanup, you will have code like:
>
> if (mutex->type == (fcntl || flock)) {
>     register_cleanup
> else
>     no-op

Why is that? Perhaps that means that apr_proc_mutex_t is the wrong
abstraction for these types.

> If you read ALL of those functions, they are all:
>
>     if ((rv = proc_mutex_(.*)_cleanup(mutex)) == APR_SUCCESS) {
>         apr_pool_cleanup_kill(mutex->pool, mutex, (.*)_sysv_cleanup);
>         return APR_SUCCESS;
>     }
>     return rv;
>
> which means that the cleanup as it is registered today is equivalent to
> apr_proc_mutex_destroy.  The functions that they call are equivalent.  
> I
> am making the change now to use a public function for the cleanup, 
> because
> anything else is completely bogus.

So there will have to be some internal changes (no critical interface 
changes)
to make this work. Fine, but what will this break anything else?

Let's talk about some use cases:

1) parent owns semaphore
    children created and destroyed, after which parent destroys semaphore
2) parent creates semaphore and delegates ownership to a child
3) one process creates semaphore, another non-related process attaches 
to
    the semaphore, a third takes responsibility for destroying it
...

To me it seems the only way to support all of these use cases would be
to allow the app to explicitly create and destroy their own semaphores
whenever they want without any magic cleanup handlers happen 
unexpectedly.

> If you read my messages from earlier today, I have already answered 
> all of
> the questions that you posted in each of the 5 messages you just sent.
> But one more time:
>
> The parent process allocates the mutex out of whatever pool should 
> control
> the scope.

This is not in dispute. In testprocmutex, the mutex was created in the 
global
pool, and since the parent global pool lives longer than the children
(We call waitpid to synchronize the children) it was expected that the
mutex would live until the end. This was a false assumption.

> The child processesmust call apr_proc_mutex_child_init.  That function
> defines the scope of the mutex for the child process.  If the mutex 
> type
> requires a cleanup in the child process, then one is registered.

Again, not in dispute. This still needs to be added, btw.

> As for the bug that was hit.  Programs _MUST_ have an equal number of
> apr_initialize and apr_terminate calls.  If it doesn't, then bugs like
> this will happen.

I'm glad we worked through and found this to be the problem, since it
has shown me how broken this part of APR is. Could we solve this by
having apr_proc_create() increment the global counter? Are there
resources that will not be cleaned up in children if we always
call ap_initialize() in child processes (eg. when a process dies
on Windows, will it close all its files and sockets?)?

-aaron

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On Tue, 19 Nov 2002, Aaron Bannert wrote:

> On Tuesday, November 19, 2002, at 12:34  PM, <rb...@rkbloom.net> wrote:
>
> >
> > On 19 Nov 2002, Philip Martin wrote:
> >
> >> <rb...@rkbloom.net> writes:
> >>
> >>> If you don't want the child process to destroy the mutex, then you
> >>> should
> >>> kill that cleanup in the child process.  That is why we have the
> >>> apr_pool_cleanup_kill API.
> >>
> >> Are you suggesting the application should do this, or the library?
> >
> > The app would have to do this.  The easiest way to do this, is to
> > change
> > APR so that all cleanups are registered as apr_proc_mutex_destroy.
> > Then,
> > the app just needs to call
> >
> >     apr_pool_cleanup_kill(pool, mutex, apr_proc_mutex_destroy);
> >
>
> This is silly, if you do not register any cleanups and let the app
> decide
> what it wants to do. The app can register a cleanup in a pool if they
> would like that pool to control the mutex's scope.

By allocating the mutex in the pool they have already stated that the pool
should control the mutex's scope.  That is the meaning of allocating a
varibale inside a pool.  This is why pools are passed to ALL functions.

BTW, you can't leave this up to the app.  The abstraction layer makes it
too complex to have the app take care of this.  I thought at first that
the app should just kill the cleanup, but that is too hard.  If you make
the app register the cleanup, you will have code like:

if (mutex->type == (fcntl || flock)) {
    register_cleanup
else
    no-op

The problem is that some lock types need to be cleaned in the child
process while others don't.

> > No new API required.  In fact, I actually consider it a VERY big bug
> > that
> > apps can't currently kill the cleanup.  To the best of my knowledge,
> > this
> > is the only APR sub-system where the APR developer doesn't have that
> > kind
> > of control.
>
> The cleanup function is private to the mutex implementation, it *can't*
> be made public. (There are many cleanup routines -- autoconf decides
> which
> one gets used.)

Aaron, I'm sorry, but you just aren't paying attention to the code.  For
all intents and purposes the cleanup function _is_ already public.  The
name of the function (as I've been saying all day), is
apr_proc_mutex_destroy.  As proof, read the code:

apr_proc_mutex_destroy calls mutex->meth->destroy

mutex->meth->destroy is always proc_mutex_(\.*)_destroy, where .* is one
of:
	proc_pthread, fcntl, flock, sysv.

If you read ALL of those functions, they are all:

    if ((rv = proc_mutex_(.*)_cleanup(mutex)) == APR_SUCCESS) {
        apr_pool_cleanup_kill(mutex->pool, mutex, (.*)_sysv_cleanup);
        return APR_SUCCESS;
    }
    return rv;

which means that the cleanup as it is registered today is equivalent to
apr_proc_mutex_destroy.  The functions that they call are equivalent.  I
am making the change now to use a public function for the cleanup, because
anything else is completely bogus.

If you read my messages from earlier today, I have already answered all of
the questions that you posted in each of the 5 messages you just sent.
But one more time:

The parent process allocates the mutex out of whatever pool should control
the scope.

The child processesmust call apr_proc_mutex_child_init.  That function
defines the scope of the mutex for the child process.  If the mutex type
requires a cleanup in the child process, then one is registered.

As for the bug that was hit.  Programs _MUST_ have an equal number of
apr_initialize and apr_terminate calls.  If it doesn't, then bugs like
this will happen.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Tuesday, November 19, 2002, at 12:34  PM, <rb...@rkbloom.net> wrote:

>
> On 19 Nov 2002, Philip Martin wrote:
>
>> <rb...@rkbloom.net> writes:
>>
>>> If you don't want the child process to destroy the mutex, then you 
>>> should
>>> kill that cleanup in the child process.  That is why we have the
>>> apr_pool_cleanup_kill API.
>>
>> Are you suggesting the application should do this, or the library?
>
> The app would have to do this.  The easiest way to do this, is to 
> change
> APR so that all cleanups are registered as apr_proc_mutex_destroy.  
> Then,
> the app just needs to call
>
>     apr_pool_cleanup_kill(pool, mutex, apr_proc_mutex_destroy);
>

This is silly, if you do not register any cleanups and let the app 
decide
what it wants to do. The app can register a cleanup in a pool if they
would like that pool to control the mutex's scope.

> No new API required.  In fact, I actually consider it a VERY big bug 
> that
> apps can't currently kill the cleanup.  To the best of my knowledge, 
> this
> is the only APR sub-system where the APR developer doesn't have that 
> kind
> of control.

The cleanup function is private to the mutex implementation, it *can't*
be made public. (There are many cleanup routines -- autoconf decides 
which
one gets used.)

-aaron

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Philip Martin wrote:

> <rb...@rkbloom.net> writes:
>
> > If you don't want the child process to destroy the mutex, then you should
> > kill that cleanup in the child process.  That is why we have the
> > apr_pool_cleanup_kill API.
>
> Are you suggesting the application should do this, or the library?

The app would have to do this.  The easiest way to do this, is to change
APR so that all cleanups are registered as apr_proc_mutex_destroy.  Then,
the app just needs to call

    apr_pool_cleanup_kill(pool, mutex, apr_proc_mutex_destroy);

No new API required.  In fact, I actually consider it a VERY big bug that
apps can't currently kill the cleanup.  To the best of my knowledge, this
is the only APR sub-system where the APR developer doesn't have that kind
of control.

> Having the library do it doesn't sound very robust, when would you do
> it?  At apr_proc_fork time?  What about a third party library that
> doesn't use APR, calls fork directly and later runs atexit handlers?

Yes, I agree doing it in the library is incredibly non-robust.

> The application cannot do it at present because it has no access to
> the cleanup handler.  So this would mean adding a function to the
> proc_mutex interface, apr_proc_mutex_cleanup_kill perhaps?  We would
> need apr_global_mutex_cleanup_kill as well.  Even then, an application
> using a proc_mutex has to know about every use of fork/apr_proc_fork
> in every library.

See above, no new functions are needed, APR just needs to be modified to
use the external APIs that are already provided when it registers a
cleanup.  As for the app needing to know about every use of
fork/apr_proc_fork in every library, that isn't true.  The app needs to
know about every use of apr_terminate in every library.  It is the
apr_terminate call that is causing the problem, not the actual fork()
call.  However, if the app removes the proc_mutex cleanup before the new
process is spawned, it won't have to worry about this, because the child
process won't have a cleanup to kill.

The only case that is really hard to deal with here, is when the parent
process wants to keep the cleanup and the child process wants to kill it.
This is why we created the inherit APIs a while ago, and the locking
system should be migrated to use it.

> I think the current pool cleanup handler is a mistake.  The handler
> should cleanup only those resources local to the process, leaving the
> proc_mutex in a working state.  It should be the application's
> responsibility to call (or fail to call :) apr_proc_mutex_destroy at
> the appropriate place.

The hard part is figuring out which resource is local to the process.
Some locks need to be deleted in each child, some can't be.  The other
option to solve this, is to actually use the apr_proc_mutex_child_init
function correctly.  If the mutex type can't be deleted in the child, then
apr_proc_mutex_child_init should remove that cleanup.

Regardless, the lock code needs to use the correct cleanup function, which
is the public function, so that the APR developer _can_ remove the cleanup
if they want to.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Tuesday, November 19, 2002, at 12:02  PM, Philip Martin wrote:
> I think the current pool cleanup handler is a mistake.  The handler
> should cleanup only those resources local to the process, leaving the
> proc_mutex in a working state.

This is how it is now, only that when we use fork(), all the cleanups
in the parent are magically inherited in the children. We should simply
just not register cleanups for mutexes, as you say here:

>   It should be the application's
> responsibility to call (or fail to call :) apr_proc_mutex_destroy at
> the appropriate place.

+1

-aaron

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

<rb...@rkbloom.net> writes:

> If you don't want the child process to destroy the mutex, then you should
> kill that cleanup in the child process.  That is why we have the
> apr_pool_cleanup_kill API.

Are you suggesting the application should do this, or the library?

Having the library do it doesn't sound very robust, when would you do
it?  At apr_proc_fork time?  What about a third party library that
doesn't use APR, calls fork directly and later runs atexit handlers?

The application cannot do it at present because it has no access to
the cleanup handler.  So this would mean adding a function to the
proc_mutex interface, apr_proc_mutex_cleanup_kill perhaps?  We would
need apr_global_mutex_cleanup_kill as well.  Even then, an application
using a proc_mutex has to know about every use of fork/apr_proc_fork
in every library.

I think the current pool cleanup handler is a mistake.  The handler
should cleanup only those resources local to the process, leaving the
proc_mutex in a working state.  It should be the application's
responsibility to call (or fail to call :) apr_proc_mutex_destroy at
the appropriate place.

-- 
Philip Martin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Tuesday, November 19, 2002, at 10:46  AM, <rb...@rkbloom.net> wrote:
>
> If you don't want the child process to destroy the mutex, then you 
> should
> kill that cleanup in the child process.  That is why we have the
> apr_pool_cleanup_kill API.
>
Naw, we just shouldn't automatically destroy locks in a cleanup. 
Destroying
all mutexes should just become the responsibility of the application.

-aaron

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Jeff Trawick wrote:

> Philip Martin <ph...@codematters.co.uk> writes:
>
> > Jeff Trawick <tr...@attglobal.net> writes:
> >
> > > > > Huh?  I don't understand this.  The child process still destroys the
> > > > > semaphore.
> > > >
> > > > What code is causing the child process to destroy the semaphore?  That
> > > > isn't happening for me.  Only the parent is doing the semctl(IPC_RMID)
> > > > with my patch.
> > >
> > > add "on my system" at the end of the last sentence...
> >
> > Oops, my mistake, I failed to apply the complete patch.  It works here
> > too.
> >
> > I find disallowing apr_terminate an ugly solution :-(  I suppose it's
> > valid for the testsuite problem, but it doesn't look good for
> > applications in general.
>
> Yeah, this apparent restriction on apr_terminate()/pool cleanup is
> unexpected and ugly.

It is also a very bad idea.  :-)  Remember that apr_terminate/pool cleanup
is also going to perform other operations that are essential to properly
cleaning up the child process.

If you don't want the child process to destroy the mutex, then you should
kill that cleanup in the child process.  That is why we have the
apr_pool_cleanup_kill API.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

Philip Martin <ph...@codematters.co.uk> writes:

> Jeff Trawick <tr...@attglobal.net> writes:
> 
> > > > Huh?  I don't understand this.  The child process still destroys the
> > > > semaphore.
> > > 
> > > What code is causing the child process to destroy the semaphore?  That
> > > isn't happening for me.  Only the parent is doing the semctl(IPC_RMID)
> > > with my patch.
> > 
> > add "on my system" at the end of the last sentence...
> 
> Oops, my mistake, I failed to apply the complete patch.  It works here
> too.
> 
> I find disallowing apr_terminate an ugly solution :-(  I suppose it's
> valid for the testsuite problem, but it doesn't look good for
> applications in general.

Yeah, this apparent restriction on apr_terminate()/pool cleanup is
unexpected and ugly.

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Jeff Trawick <tr...@attglobal.net> writes:

> > > Huh?  I don't understand this.  The child process still destroys the
> > > semaphore.
> > 
> > What code is causing the child process to destroy the semaphore?  That
> > isn't happening for me.  Only the parent is doing the semctl(IPC_RMID)
> > with my patch.
> 
> add "on my system" at the end of the last sentence...

Oops, my mistake, I failed to apply the complete patch.  It works here
too.

I find disallowing apr_terminate an ugly solution :-(  I suppose it's
valid for the testsuite problem, but it doesn't look good for
applications in general.

-- 
Philip Martin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

Jeff Trawick <tr...@attglobal.net> writes:

> Philip Martin <ph...@codematters.co.uk> writes:
> 
> > Jeff Trawick <tr...@attglobal.net> writes:
> > 
> > > I contemplated relatively-complex exit sequences so that children
> > > didn't exit until the test was over, but life is short, and these test
> > > programs should be short too.
> > > 
> > > Besides adding crude error checking (crude better than none) for some
> > > critical APR calls, this patch ensures that apr_terminate() is not
> > > called in the child processes.
> > 
> > Huh?  I don't understand this.  The child process still destroys the
> > semaphore.
> 
> What code is causing the child process to destroy the semaphore?  That
> isn't happening for me.  Only the parent is doing the semctl(IPC_RMID)
> with my patch.

add "on my system" at the end of the last sentence...

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

Philip Martin <ph...@codematters.co.uk> writes:

> Jeff Trawick <tr...@attglobal.net> writes:
> 
> > I contemplated relatively-complex exit sequences so that children
> > didn't exit until the test was over, but life is short, and these test
> > programs should be short too.
> > 
> > Besides adding crude error checking (crude better than none) for some
> > critical APR calls, this patch ensures that apr_terminate() is not
> > called in the child processes.
> 
> Huh?  I don't understand this.  The child process still destroys the
> semaphore.

What code is causing the child process to destroy the semaphore?  That
isn't happening for me.  Only the parent is doing the semctl(IPC_RMID)
with my patch.

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Jeff Trawick <tr...@attglobal.net> writes:

> I contemplated relatively-complex exit sequences so that children
> didn't exit until the test was over, but life is short, and these test
> programs should be short too.
> 
> Besides adding crude error checking (crude better than none) for some
> critical APR calls, this patch ensures that apr_terminate() is not
> called in the child processes.

Huh?  I don't understand this.  The child process still destroys the
semaphore.  When a child exits there is nothing that guarantees that
the other children have finished.  It depends on the OS scheduler,
unless the children implement some explicit synchronisation.

Unsurprisingly, I get

$ ./testprocmutex  
APR Proc Mutex Test
==============

Exclusive lock test
    Initializing the lock                                   OK
    Starting all of the processes                           lt-testprocmutex: testprocmutex.c:93: make_child: Assertion `apr_proc_mutex_unlock(proc_lock) == 0' failed.
lt-testprocmutex: testprocmutex.c:93: make_child: Assertion `apr_proc_mutex_unlock(proc_lock) == 0' failed.
OK
lt-testprocmutex: testprocmutex.c:93: make_child: Assertion `apr_proc_mutex_unlock(proc_lock) == 0' failed.
    Waiting for processes to exit                           OK
Locks don't appear to work!  x = 4003 instead of 16000

> In reality it would be nice to set up a proc mutex such that it is
> only destroyed up in the parent.  I suspect that this is reasonable
> default behavior, in which case no API change would be necessary.

It's more than "nice"!  Without some change to the current behaviour
it is not possible for processes to exit unless they implement some
sort of synchronisation.

> This could be implemented by storing a pid_t in the internal mutex
> representation and comparing that with getpid() in the cleanup before
> deciding whether to tell the kernel to destroy the mutex.  The exact
> details would vary between mutex types.  For file-based mutexes, there
> is no destroy syscall, and the existing logic to close the file still
> needs to be performed regardless of parent or child.

Of course the test should check return values, but that's not enough,
even if the pool cleanup is fixed.  The test still doesn't really test
that the proc_mutex implementation provides any mutex behaviour.
Given that the old test often passed even after the mutex had been
destroyed, it is likely that the test would pass with a proc_mutex
implementation consisting of stubs that return APR_SUCCESS, or on a
platform where the underlying semaphore is broken.  That's not much of
a test (or my expectations of the test suite are too high).

> Index: testprocmutex.c
> ===================================================================
> RCS file: /home/cvs/apr/test/testprocmutex.c,v
> retrieving revision 1.10
> diff -u -r1.10 testprocmutex.c
> --- testprocmutex.c	9 Apr 2002 06:45:06 -0000	1.10
> +++ testprocmutex.c	19 Nov 2002 13:28:10 -0000
> @@ -60,6 +60,7 @@
>  #include "apr_general.h"
>  #include "apr_getopt.h"
>  #include "errno.h"
> +#include <assert.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include "test_apr.h"
> @@ -70,7 +71,7 @@
>  
>  apr_proc_mutex_t *proc_lock;
>  apr_pool_t *pool;
> -int *x;
> +int *x, *y;
>  
>  static int make_child(apr_proc_t **proc, apr_pool_t *p)
>  {
> @@ -82,14 +83,14 @@
>  
>      if (apr_proc_fork(*proc, p) == APR_INCHILD) {
>          while (1) {
> -            apr_proc_mutex_lock(proc_lock); 
> +            assert(apr_proc_mutex_lock(proc_lock) == APR_SUCCESS);
>              if (i == MAX_ITER) {
> -                apr_proc_mutex_unlock(proc_lock); 
> +                assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
>                  exit(1);
>              }
>              i++;
>              (*x)++;

This line is the problem, (*x)++ is too fast to be a good test of the
mutex, it's simply too few instructions for there to be a much chance
of the OS scheduling another process (unless one happens to be using
an SMP machine).  The sequence "read, wait, increment, write" is a
much better test as the wait requires the other processes to block.
The test doesn't have to run the loop thousands of times, half a dozen
would do, but it does have to hold the mutex for a "significant"
length of time.

> -            apr_proc_mutex_unlock(proc_lock); 
> -            apr_proc_mutex_unlock(proc_lock); 
> +            assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
>          }
>          exit(1);
>      }

-- 
Philip Martin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On Tue, 19 Nov 2002 rbb@rkbloom.net wrote:

> On Tue, 19 Nov 2002, Justin Erenkrantz wrote:
>
> > --On Tuesday, November 19, 2002 8:39 AM -0500 Jeff Trawick
> > <tr...@attglobal.net> wrote:
> >
> > > I contemplated relatively-complex exit sequences so that children
> > > didn't exit until the test was over, but life is short, and these
> > > test programs should be short too.
> >
> > Okay, here are the problems that we identified yesterday.
> > Unfortunately, we didn't have time to indicate to the list what our
> > resolution is.  Hence, you guys are somewhat out of the loop and I
> > think are going off in the wrong direction here.

BTW, it is absolutely impossible for us to be "out of the loop" in this
instance.  Yes, it is great that there is a conference and that you guys
are getting face-time.  That doesn't mean that it is valid to decide what
the correct solution is without bringing it to the list.  If you think
that we are going in the wrong direction, then please tell us WHY we are
going the wrong direction.  I personally read this paragraph and basically
started reading the rest of the message with the intent of proving you
wrong.

The reality is that your solution isn't that far away from my solution.
But, mine came with a very exacting explanation of what the test was doing
wrong.  If you would please explain your logic, we can have a technical
discussion about the bug now.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On Tue, 19 Nov 2002, Justin Erenkrantz wrote:

> --On Tuesday, November 19, 2002 8:39 AM -0500 Jeff Trawick
> <tr...@attglobal.net> wrote:
>
> > I contemplated relatively-complex exit sequences so that children
> > didn't exit until the test was over, but life is short, and these
> > test programs should be short too.
>
> Okay, here are the problems that we identified yesterday.
> Unfortunately, we didn't have time to indicate to the list what our
> resolution is.  Hence, you guys are somewhat out of the loop and I
> think are going off in the wrong direction here.
>
> 1) testprocmutex isn't calling apr_proc_mutex_child_init
>    (note that httpd does)
> 2) SysVSem's proc mutex isn't doing anything on child_init
> 3) In the SysVSem cleanup, semctl isn't properly identifying errors
> 4) All children are running the cleanups where only the parent should
>
> What we said we should do is that the child_init call should remove
> the cleanup from the children (since it can do the cleanup_kill as
> the SysV cleanup is a static call within proc_mutex.c).  The proc
> mutex cleanups should only be done by the parent not each child.

I don't care if the code _can_ do the cleanup with a static function.  It
SHOULDN'T do the cleanup with a static function.  By using a static
function, you have taken a great deal of flexibility away from the APR
developer, because they can't remove the cleanup anymore.  That is
completely bogus.

There is absolutely no way to say that the proc mutex cleanups should only
be done by the parent, not the child, and even if there was, it would be
wrong.

There are two types of proc mutexes, file-backed and semaphore.  As I
explained in detail, file-backed mutexes need to be cleaned up in both
parent and child, while semaphores should only be cleaned in the parent.

> I believe Aaron's initial intent of his reply to Philip was that he
> couldn't reproduce it, but that he (and the rest of the Las Vegas
> gang) did know exactly what to do to fix this correctly.  -- justin

Well, you have missed the most important fix, which is to fix the test
case.  As I explained in detail earlier today, the test program is bogus.
If the test program had been correct, this bug would not have happened.

I posted the exact fix for this earlier today.  It is three fold, and it
is the only fix that has been posted that will fix this problem correctly.

1)  apr_proc_mutex_child_init needs to do the right thing for all mutex
types.  File-backed mutexes leave it alone, other mutexes get the cleanup
removed.

2)  All mutex code must use apr_*_mutex_destroy when registering pool
cleanups.  In fact, throughout the APR code, the pool cleanups must be
registered with public functions.

3)  The test case must be fixed so that it doesn't call apr_terminate
without calling apr_initialize.

If you see an error in my logic, please tell me.  But do not tell me that
a bunch of you got together off-line and decided how to fix this.
According to what you are saying above, your fix is not correct, and it
doesn't solve the problem.  See above for details.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Justin Erenkrantz <je...@apache.org>.

--On Tuesday, November 19, 2002 8:39 AM -0500 Jeff Trawick 
<tr...@attglobal.net> wrote:

> I contemplated relatively-complex exit sequences so that children
> didn't exit until the test was over, but life is short, and these
> test programs should be short too.

Okay, here are the problems that we identified yesterday. 
Unfortunately, we didn't have time to indicate to the list what our 
resolution is.  Hence, you guys are somewhat out of the loop and I 
think are going off in the wrong direction here.

1) testprocmutex isn't calling apr_proc_mutex_child_init
   (note that httpd does)
2) SysVSem's proc mutex isn't doing anything on child_init
3) In the SysVSem cleanup, semctl isn't properly identifying errors
4) All children are running the cleanups where only the parent should

What we said we should do is that the child_init call should remove 
the cleanup from the children (since it can do the cleanup_kill as 
the SysV cleanup is a static call within proc_mutex.c).  The proc 
mutex cleanups should only be done by the parent not each child.

I believe Aaron's initial intent of his reply to Philip was that he 
couldn't reproduce it, but that he (and the rest of the Las Vegas 
gang) did know exactly what to do to fix this correctly.  -- justin

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Jeff Trawick wrote:

> <rb...@rkbloom.net> writes:
>
> > This would be a bad design change IMNSHO.  The library has no business
> > deciding when a mutex is destroyed, that is the role of the application
> > that created the mutex.  It isn't too hard to imagine a situation where a
> > mutex is created in the parent process, but the child processes are the
> > only ones that know anything at all about when it is safe to destroy it.
> > The app must be the thing that decides when to close the mutex.
>
> Perhaps I'm missing something obvious, but the app doesn't seem to
> have much control over when the mutex is destroyed.  Either it calls
> apr_terminate() or not, which isn't a very nice way to control it.

You already responded to my apr_pool_cleanup_kill post, so this question
is moot now.   :-)

> An app could do what Apache does and allocate the mutex from a pool
> which is never cleaned up (app doesn't call apr_terminate() but is
> careful to clean up all pools except for the one with the mutex in
> it).

Which leaves my reason for even responding to this note.  If this is
actually what Apache is doing, then Apache has a huge bug.  Apr_terminate
shouldn't have any concept of a pool that shouldn't be destroyed, all it
is doing, is destroying the global_pool, which should be the parent of all
pools.  Apache should definately switch to having the child processes call
apr_terminate() correctly.  For that matter, Apache child processes should
also probably call apr_initialize().  Having a pool that never gets
cleaned up is asking for somebody to misuse the pool and to have resources
leak.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

<rb...@rkbloom.net> writes:

> This would be a bad design change IMNSHO.  The library has no business
> deciding when a mutex is destroyed, that is the role of the application
> that created the mutex.  It isn't too hard to imagine a situation where a
> mutex is created in the parent process, but the child processes are the
> only ones that know anything at all about when it is safe to destroy it.
> The app must be the thing that decides when to close the mutex.

Perhaps I'm missing something obvious, but the app doesn't seem to
have much control over when the mutex is destroyed.  Either it calls
apr_terminate() or not, which isn't a very nice way to control it.

An app could do what Apache does and allocate the mutex from a pool
which is never cleaned up (app doesn't call apr_terminate() but is
careful to clean up all pools except for the one with the mutex in
it).
-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

So are you saying that we should not ever automatically destroy
locks in cleanups?

-aaron


On Tuesday, November 19, 2002, at 08:59  AM, <rb...@rkbloom.net> wrote:
> This would be a bad design change IMNSHO.  The library has no business
> deciding when a mutex is destroyed, that is the role of the application
> that created the mutex.  It isn't too hard to imagine a situation 
> where a
> mutex is created in the parent process, but the child processes are the
> only ones that know anything at all about when it is safe to destroy 
> it.
> The app must be the thing that decides when to close the mutex.
>
> Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

On 19 Nov 2002, Jeff Trawick wrote:

> I contemplated relatively-complex exit sequences so that children
> didn't exit until the test was over, but life is short, and these test
> programs should be short too.
>
> Besides adding crude error checking (crude better than none) for some
> critical APR calls, this patch ensures that apr_terminate() is not
> called in the child processes.
>
> In reality it would be nice to set up a proc mutex such that it is
> only destroyed up in the parent.  I suspect that this is reasonable
> default behavior, in which case no API change would be necessary.
> This could be implemented by storing a pid_t in the internal mutex
> representation and comparing that with getpid() in the cleanup before
> deciding whether to tell the kernel to destroy the mutex.  The exact
> details would vary between mutex types.  For file-based mutexes, there
> is no destroy syscall, and the existing logic to close the file still
> needs to be performed regardless of parent or child.

This would be a bad design change IMNSHO.  The library has no business
deciding when a mutex is destroyed, that is the role of the application
that created the mutex.  It isn't too hard to imagine a situation where a
mutex is created in the parent process, but the child processes are the
only ones that know anything at all about when it is safe to destroy it.
The app must be the thing that decides when to close the mutex.

Ryan

Re: [PATCH] Re: apr_proc_mutex is broken

Posted by rb...@rkbloom.net.

This patch also highlights another problem with our current test suite.
Please, stop putting assert() statements in the test suite.  Asserts are
not a valid way to determine if a test was successful, because they stop
all other tests from running.

Ryan

>  static int make_child(apr_proc_t **proc, apr_pool_t *p)
>  {
> @@ -82,14 +83,14 @@
>
>      if (apr_proc_fork(*proc, p) == APR_INCHILD) {
>          while (1) {
> -            apr_proc_mutex_lock(proc_lock);
> +            assert(apr_proc_mutex_lock(proc_lock) == APR_SUCCESS);
>              if (i == MAX_ITER) {
> -                apr_proc_mutex_unlock(proc_lock);
> +                assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
>                  exit(1);
>              }
>              i++;
>              (*x)++;
> -            apr_proc_mutex_unlock(proc_lock);
> +            assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
>          }
>          exit(1);
>      }
> @@ -155,7 +156,6 @@
>      printf("APR Proc Mutex Test\n==============\n\n");
>
>      apr_initialize();
> -    atexit(apr_terminate);
>
>      if (apr_pool_create(&pool, NULL) != APR_SUCCESS)
>          exit(-1);
> @@ -178,15 +178,22 @@
>          exit(-1);
>      }
>
> -    apr_shm_create(&shm, sizeof(int), shmname, pool);
> +    assert(apr_shm_create(&shm, 2 * sizeof(int), shmname, pool) == APR_SUCCESS);
>      x = apr_shm_baseaddr_get(shm);
> +    y = x + 1;
>
>      if ((rv = test_exclusive(lockname)) != APR_SUCCESS) {
>          fprintf(stderr,"Exclusive Lock test failed : [%d] %s\n",
>                  rv, apr_strerror(rv, (char*)errmsg, 200));
>          exit(-2);
>      }
> -
> +
> +    /* if the child processes run apr_terminate() when they exit, the
> +     * mutex will be destroyed and remaining children will be left with
> +     * a useless mutex
> +     */
> +    apr_terminate();
> +
>      return 0;
>  }
>
>
> --
> Jeff Trawick | trawick@attglobal.net
> Born in Roswell... married an alien...
>

[PATCH] Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

I contemplated relatively-complex exit sequences so that children
didn't exit until the test was over, but life is short, and these test
programs should be short too.

Besides adding crude error checking (crude better than none) for some
critical APR calls, this patch ensures that apr_terminate() is not
called in the child processes.

In reality it would be nice to set up a proc mutex such that it is
only destroyed up in the parent.  I suspect that this is reasonable
default behavior, in which case no API change would be necessary.
This could be implemented by storing a pid_t in the internal mutex
representation and comparing that with getpid() in the cleanup before
deciding whether to tell the kernel to destroy the mutex.  The exact
details would vary between mutex types.  For file-based mutexes, there
is no destroy syscall, and the existing logic to close the file still
needs to be performed regardless of parent or child.

Index: testprocmutex.c
===================================================================
RCS file: /home/cvs/apr/test/testprocmutex.c,v
retrieving revision 1.10
diff -u -r1.10 testprocmutex.c
--- testprocmutex.c	9 Apr 2002 06:45:06 -0000	1.10
+++ testprocmutex.c	19 Nov 2002 13:28:10 -0000
@@ -60,6 +60,7 @@
 #include "apr_general.h"
 #include "apr_getopt.h"
 #include "errno.h"
+#include <assert.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include "test_apr.h"
@@ -70,7 +71,7 @@
 
 apr_proc_mutex_t *proc_lock;
 apr_pool_t *pool;
-int *x;
+int *x, *y;
 
 static int make_child(apr_proc_t **proc, apr_pool_t *p)
 {
@@ -82,14 +83,14 @@
 
     if (apr_proc_fork(*proc, p) == APR_INCHILD) {
         while (1) {
-            apr_proc_mutex_lock(proc_lock); 
+            assert(apr_proc_mutex_lock(proc_lock) == APR_SUCCESS);
             if (i == MAX_ITER) {
-                apr_proc_mutex_unlock(proc_lock); 
+                assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
                 exit(1);
             }
             i++;
             (*x)++;
-            apr_proc_mutex_unlock(proc_lock); 
+            assert(apr_proc_mutex_unlock(proc_lock) == APR_SUCCESS);
         }
         exit(1);
     }
@@ -155,7 +156,6 @@
     printf("APR Proc Mutex Test\n==============\n\n");
         
     apr_initialize();
-    atexit(apr_terminate);
 
     if (apr_pool_create(&pool, NULL) != APR_SUCCESS)
         exit(-1);
@@ -178,15 +178,22 @@
         exit(-1);
     }
 
-    apr_shm_create(&shm, sizeof(int), shmname, pool);
+    assert(apr_shm_create(&shm, 2 * sizeof(int), shmname, pool) == APR_SUCCESS);
     x = apr_shm_baseaddr_get(shm);
+    y = x + 1;
 
     if ((rv = test_exclusive(lockname)) != APR_SUCCESS) {
         fprintf(stderr,"Exclusive Lock test failed : [%d] %s\n",
                 rv, apr_strerror(rv, (char*)errmsg, 200));
         exit(-2);
     }
-    
+  
+    /* if the child processes run apr_terminate() when they exit, the
+     * mutex will be destroyed and remaining children will be left with
+     * a useless mutex
+     */ 
+    apr_terminate(); 
+
     return 0;
 }
 

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

Philip Martin <ph...@codematters.co.uk> writes:

> Jeff Trawick <tr...@attglobal.net> writes:
> 
> > I think the problem is in the test programs (e.g., testprocmutex).  As
> > soon as one child hits the specified number of iterations, the child
> > will exit and something like this will happen:
> > 
> >   #0  proc_mutex_sysv_cleanup (mutex_=0x804e608) at proc_mutex.c:205
> >   #1  0x4002dfa1 in run_cleanups (c=0x804e660) at apr_pools.c:1973
> >   #2  0x4002d59e in apr_pool_destroy (pool=0x804e4f8) at apr_pools.c:755
> >   #3  0x4002d58a in apr_pool_destroy (pool=0x804a4e8) at apr_pools.c:752
> >   #4  0x4002d24a in apr_pool_terminate () at apr_pools.c:585
> >   #5  0x4002a45d in apr_terminate () at start.c:117
> > 
> > And of course as soon as semctl(IPC_RMID) is done, the lock is broken.
> 
> No question in my mind.  The problem is that Aaron cannot reproduce
> it...

not a problem at all that Aaron can't hit the problem :)  it is
clearly broken just by inspection 

even with my setup, which never showed testprocmutex failing, strace
shows the mutex getting cleaned up by one process while other
processes are still trying to use them... and we know that broken
mutexes *may* lead to the sort of bogosity that testprocmutex checks
for, but the counter doesn't automatically get out of sync just because
the mutexes aren't used, so testprocmutex will often *not* realize the
snafu...

[pid 21884] semop(4915210, 0x400315b2, 1 <unfinished ...>
[pid 21881] semctl(4915210, 0, 0x100 /* SEM_??? */, 0xbffff598 <unfinished ...>
[pid 21884] <... semop resumed> )       = 0
[pid 21883] <... semop resumed> )       = -1 EIDRM (Identifier removed)
[pid 21881] <... semctl resumed> )      = 0
[pid 21884] semop(4915210, 0x400315ac, 1 <unfinished ...>
[pid 21882] <... semop resumed> )       = -1 EIDRM (Identifier removed)
[pid 21881] shmctl(11698257, 0x100 /* SHM_??? */, 0 <unfinished ...>
[pid 21884] <... semop resumed> )       = -1 EINVAL (Invalid argument)

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Jeff Trawick <tr...@attglobal.net> writes:

> I think the problem is in the test programs (e.g., testprocmutex).  As
> soon as one child hits the specified number of iterations, the child
> will exit and something like this will happen:
> 
>   #0  proc_mutex_sysv_cleanup (mutex_=0x804e608) at proc_mutex.c:205
>   #1  0x4002dfa1 in run_cleanups (c=0x804e660) at apr_pools.c:1973
>   #2  0x4002d59e in apr_pool_destroy (pool=0x804e4f8) at apr_pools.c:755
>   #3  0x4002d58a in apr_pool_destroy (pool=0x804a4e8) at apr_pools.c:752
>   #4  0x4002d24a in apr_pool_terminate () at apr_pools.c:585
>   #5  0x4002a45d in apr_terminate () at start.c:117
> 
> And of course as soon as semctl(IPC_RMID) is done, the lock is broken.

No question in my mind.  The problem is that Aaron cannot reproduce
it...

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Jeff Trawick <tr...@attglobal.net>.

Philip Martin <ph...@codematters.co.uk> writes:

> I'm using the following patch to the testsuite.  It increases the
> amount of contention for the shared data, and causes the tests to fail
> reliably on my SMP workstation, using ReiserFS, ext2 and tmpfs, as
> well as on my non-SMP laptop using ext3.  Aaron is having problems
> reproducing the bug, perhaps someone else could try.
> 
> Index: test/testglobalmutex.c
> ===================================================================
> RCS file: /home/cvspublic/apr/test/testglobalmutex.c,v
> retrieving revision 1.3
> diff -u -r1.3 testglobalmutex.c
> --- test/testglobalmutex.c	9 Apr 2002 06:45:06 -0000	1.3
> +++ test/testglobalmutex.c	19 Nov 2002 00:48:46 -0000
> @@ -65,7 +65,7 @@
>  #include "test_apr.h"
>  
>  
> -#define MAX_ITER 4000
> +#define MAX_ITER 40
>  #define MAX_COUNTER (MAX_ITER * 4)
>  
>  apr_global_mutex_t *global_lock;
> @@ -82,13 +82,16 @@
>  
>      if (apr_proc_fork(*proc, p) == APR_INCHILD) {
>          while (1) {
> +            int x_t;
>              apr_global_mutex_lock(global_lock); 
>              if (i == MAX_ITER) {
>                  apr_global_mutex_unlock(global_lock); 
>                  exit(1);
>              }
>              i++;
> -            (*x)++;
> +            x_t = *x;
> +            apr_sleep(1);
> +            *x = x_t + 1;
>              apr_global_mutex_unlock(global_lock); 
>          }
>          exit(1);
> Index: test/testprocmutex.c
> ===================================================================
> RCS file: /home/cvspublic/apr/test/testprocmutex.c,v
> retrieving revision 1.10
> diff -u -r1.10 testprocmutex.c
> --- test/testprocmutex.c	9 Apr 2002 06:45:06 -0000	1.10
> +++ test/testprocmutex.c	19 Nov 2002 00:48:46 -0000
> @@ -65,7 +65,7 @@
>  #include "test_apr.h"
>  
>  
> -#define MAX_ITER 4000
> +#define MAX_ITER 40
>  #define MAX_COUNTER (MAX_ITER * 4)
>  
>  apr_proc_mutex_t *proc_lock;
> @@ -82,13 +82,16 @@
>  
>      if (apr_proc_fork(*proc, p) == APR_INCHILD) {
>          while (1) {
> +            int x_t;
>              apr_proc_mutex_lock(proc_lock); 
>              if (i == MAX_ITER) {
>                  apr_proc_mutex_unlock(proc_lock); 
>                  exit(1);
>              }
>              i++;
> -            (*x)++;
> +            x_t = *x;
> +            apr_sleep(1);
> +            *x = x_t + 1;
>              apr_proc_mutex_unlock(proc_lock); 
>          }
>          exit(1);
> 
> 
> Disabling the pool cleanup with this patch causes the tests to pass.
> This is obviously not a solution to the problem, it's simply a
> demonstration of the cause of the problem.

I think the problem is in the test programs (e.g., testprocmutex).  As
soon as one child hits the specified number of iterations, the child
will exit and something like this will happen:

  #0  proc_mutex_sysv_cleanup (mutex_=0x804e608) at proc_mutex.c:205
  #1  0x4002dfa1 in run_cleanups (c=0x804e660) at apr_pools.c:1973
  #2  0x4002d59e in apr_pool_destroy (pool=0x804e4f8) at apr_pools.c:755
  #3  0x4002d58a in apr_pool_destroy (pool=0x804a4e8) at apr_pools.c:752
  #4  0x4002d24a in apr_pool_terminate () at apr_pools.c:585
  #5  0x4002a45d in apr_terminate () at start.c:117

And of course as soon as semctl(IPC_RMID) is done, the lock is broken.

Here is IMHO what you should try:

1)  restore proper cleanup code
2)  tweak make_child() in the test program to trap errors from
    apr_proc_mutex_lock() and apr_proc_mutex_unlock() 

I bet you are hitting errors with the mutex, but they're not being
reported right now.  And of course if the mutex doesn't work right the
counter in shared memory is no longer interesting either.

-- 
Jeff Trawick | trawick@attglobal.net
Born in Roswell... married an alien...

Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Aaron Bannert <aa...@clove.org> writes:

> On Tue, Nov 19, 2002 at 12:34:17AM +0000, Philip Martin wrote:
> > Aaron Bannert <aa...@clove.org> writes:
> > 
> > > On Mon, Nov 18, 2002 at 09:25:08PM +0000, Philip Martin wrote:
> > > > Philip Martin <ph...@codematters.co.uk> writes:
> > > > 
> > > > [snip patch]
> > > > 
> > > > > That should increase contention for the shared data, even a non-SMP
> > > > > machine might show the bug.  The original test sometimes passes on my
> > > > > machine, with the above patch it always fails
> > > 
> > > What filesystem are you using?
> > 
> > ReiserFS, until now :)  Just tried ext2 and tmpfs and it fails there
> > as well.
> 
> Are you sure? I'm having a terrible time reproducing this. I 've been adding
> in error catching code all day and I'm seeing the problem you mentioned
> for when the shm file already exists, but if it doesn't exist I can't
> reproduce it.
> 
> I wouldn't at all be surprised if reiser has a problem with sysv shm on
> SMP boxes under that kernel. FWIW I've tried it on nfs and tmpfs... *shrug*
> 
> [we should probably bring this back to the dev@apr list, feel free to reply
> to this to the list :) ]

[dev@apr CC'd]

I'm using the following patch to the testsuite.  It increases the
amount of contention for the shared data, and causes the tests to fail
reliably on my SMP workstation, using ReiserFS, ext2 and tmpfs, as
well as on my non-SMP laptop using ext3.  Aaron is having problems
reproducing the bug, perhaps someone else could try.

Index: test/testglobalmutex.c
===================================================================
RCS file: /home/cvspublic/apr/test/testglobalmutex.c,v
retrieving revision 1.3
diff -u -r1.3 testglobalmutex.c
--- test/testglobalmutex.c	9 Apr 2002 06:45:06 -0000	1.3
+++ test/testglobalmutex.c	19 Nov 2002 00:48:46 -0000
@@ -65,7 +65,7 @@
 #include "test_apr.h"
 
 
-#define MAX_ITER 4000
+#define MAX_ITER 40
 #define MAX_COUNTER (MAX_ITER * 4)
 
 apr_global_mutex_t *global_lock;
@@ -82,13 +82,16 @@
 
     if (apr_proc_fork(*proc, p) == APR_INCHILD) {
         while (1) {
+            int x_t;
             apr_global_mutex_lock(global_lock); 
             if (i == MAX_ITER) {
                 apr_global_mutex_unlock(global_lock); 
                 exit(1);
             }
             i++;
-            (*x)++;
+            x_t = *x;
+            apr_sleep(1);
+            *x = x_t + 1;
             apr_global_mutex_unlock(global_lock); 
         }
         exit(1);
Index: test/testprocmutex.c
===================================================================
RCS file: /home/cvspublic/apr/test/testprocmutex.c,v
retrieving revision 1.10
diff -u -r1.10 testprocmutex.c
--- test/testprocmutex.c	9 Apr 2002 06:45:06 -0000	1.10
+++ test/testprocmutex.c	19 Nov 2002 00:48:46 -0000
@@ -65,7 +65,7 @@
 #include "test_apr.h"
 
 
-#define MAX_ITER 4000
+#define MAX_ITER 40
 #define MAX_COUNTER (MAX_ITER * 4)
 
 apr_proc_mutex_t *proc_lock;
@@ -82,13 +82,16 @@
 
     if (apr_proc_fork(*proc, p) == APR_INCHILD) {
         while (1) {
+            int x_t;
             apr_proc_mutex_lock(proc_lock); 
             if (i == MAX_ITER) {
                 apr_proc_mutex_unlock(proc_lock); 
                 exit(1);
             }
             i++;
-            (*x)++;
+            x_t = *x;
+            apr_sleep(1);
+            *x = x_t + 1;
             apr_proc_mutex_unlock(proc_lock); 
         }
         exit(1);


Disabling the pool cleanup with this patch causes the tests to pass.
This is obviously not a solution to the problem, it's simply a
demonstration of the cause of the problem.

Index: locks/unix/proc_mutex.c
===================================================================
RCS file: /home/cvspublic/apr/locks/unix/proc_mutex.c,v
retrieving revision 1.20
diff -u -r1.20 proc_mutex.c
--- locks/unix/proc_mutex.c	18 Nov 2002 01:59:03 -0000	1.20
+++ locks/unix/proc_mutex.c	19 Nov 2002 00:54:44 -0000
@@ -205,10 +205,12 @@
     apr_proc_mutex_t *mutex=mutex_;
     union semun ick;
     
+#if 0
     if (mutex->interproc->filedes != -1) {
         ick.val = 0;
         semctl(mutex->interproc->filedes, 0, IPC_RMID, ick);
     }
+#endif
     return APR_SUCCESS;
 }    
 

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Philip Martin <ph...@codematters.co.uk> writes:

> > I just ripped out that code, want to give this another try on your
> > SMP Linux box?
> 
> It still fails.
> 
> As I said, I don't know much about semaphores, but I think the problem
> is something to do with process exit.

The pool cleanup handler

static apr_status_t proc_mutex_sysv_cleanup(void *mutex_)
{
    apr_proc_mutex_t *mutex=mutex_;
    union semun ick;
    
    if (mutex->interproc->filedes != -1) {
        ick.val = 0;
        semctl(mutex->interproc->filedes, 0, IPC_RMID, ick);
    }
    return APR_SUCCESS;
}    

runs when each subprocess in the test exits.  So the first subprocess
to exit will destroy the semaphore, even though the other processes
are using it.

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Aaron Bannert <aa...@clove.org> writes:

> On Sun, Nov 17, 2002 at 12:25:46PM +0000, Philip Martin wrote:
> >   $ ./testprocmutex 
> >   APR Proc Mutex Test
> >   ==============
> > 
> >   Exclusive lock test
> >       Initializing the lock                                   OK
> >       Starting all of the processes                           OK
> >       Waiting for processes to exit                           OK
> >   Locks don't appear to work!  x = 15998 instead of 16000
> > 
> > The test is multi-process, but the processes are single threaded.  I
> > guess the problem lies somewhere in the semaphores, but I don't have
> > any experience of using those.
> 
> I just ripped out that code, want to give this another try on your
> SMP Linux box?

It still fails.

As I said, I don't know much about semaphores, but I think the problem
is something to do with process exit.  I changed proc_mutex.c as
follows

Index: locks/unix/proc_mutex.c
===================================================================
RCS file: /home/cvspublic/apr/locks/unix/proc_mutex.c,v
retrieving revision 1.20
diff -u -r1.20 proc_mutex.c
--- locks/unix/proc_mutex.c     18 Nov 2002 01:59:03 -0000      1.20
+++ locks/unix/proc_mutex.c     18 Nov 2002 16:19:25 -0000
@@ -892,6 +892,8 @@
 
 APR_DECLARE(apr_status_t) apr_proc_mutex_unlock(apr_proc_mutex_t *mutex)
 {
+    if (!mutex->curr_locked)
+        abort();
     return mutex->meth->release(mutex);
 }

and now the test fails like so

$ ./testprocmutex 
APR Proc Mutex Test
==============

Exclusive lock test
    Initializing the lock                                   OK
    Starting all of the processes                           OK
    Waiting for processes to exit                           OK
Locks don't appear to work!  x = 4003 instead of 16000


The test now fails nearly every time and the "x" value is much
lower.  Next I changed the test code as follows

Index: testprocmutex.c
===================================================================
RCS file: /home/cvspublic/apr/test/testprocmutex.c,v
retrieving revision 1.10
diff -u -r1.10 testprocmutex.c
--- testprocmutex.c     9 Apr 2002 06:45:06 -0000       1.10
+++ testprocmutex.c     18 Nov 2002 16:23:00 -0000
@@ -85,7 +85,8 @@
             apr_proc_mutex_lock(proc_lock); 
             if (i == MAX_ITER) {
                 apr_proc_mutex_unlock(proc_lock); 
-                exit(1);
+                apr_sleep (1000000);
+                exit(0);
             }
             i++;
             (*x)++;

and now the test always passes.  I believe this is because the loops
all reach MAX_ITER before any of the subprocesses exit, and so any
damage to the semaphore doesn't affect the "x" value.

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Nov 17, 2002 at 12:25:46PM +0000, Philip Martin wrote:
>   $ ./testprocmutex 
>   APR Proc Mutex Test
>   ==============
> 
>   Exclusive lock test
>       Initializing the lock                                   OK
>       Starting all of the processes                           OK
>       Waiting for processes to exit                           OK
>   Locks don't appear to work!  x = 15998 instead of 16000
> 
> The test is multi-process, but the processes are single threaded.  I
> guess the problem lies somewhere in the semaphores, but I don't have
> any experience of using those.

I just ripped out that code, want to give this another try on your
SMP Linux box?

-aaron

Re: apr_proc_mutex is broken

Posted by Philip Martin <ph...@codematters.co.uk>.

Philip Martin <ph...@codematters.co.uk> writes:

> The first problem is the line
>     if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
> where there is access to the shared data mutex->owner without any sort
> of synchronization.  Now mutex->owner may not be an atomic type, in
> which case a totally bogus value could be obtained, and if it is an
> atomic type the unsynchronized access is still pointless.
> 
> However the real problem is that this is supposed to be a *process*
> lock, and yet it is comparing *thread* IDs.  Thread IDs are distinct
> within a process, but there is no guarantee that they are distinct
> across mutiple processes.  Comparing thread IDs from two separate
> processes is undefined behaviour when using POSIX threads.  There is
> at least one common platform (GNU glibc threads) where thread IDs are
> duplicated.  When I run APR's testprocmutex test on a 2-way SMP Linux
> box it regularly fails (that it doesn't always fail is, I suspect,
> because it is not a particularly good test and so the processes often
> complete without any mutex contention).

I've looked at the proc mutex code again, and things are less clear.

I see now that apr_proc_mutex_create and apr_proc_mutex_unlock both
set the mutex->owner field to zero to indicate an invalid mutex.  This
is not valid for a POSIX thread system because a) zero may be a valid
thread ID, and b) passing an "invented" thread ID to pthread_equal is
undefined behaviour.

However it may well work on a Linux glibc 2.2.5 system where I believe
pthread_t is an unsigned long and zero is not used as a thread ID.  It
also means that my complaint about comparing thread IDs from different
processes does not apply (I was assuming mutex->owner was initialized
to the thread ID of the thread that created the proc mutex, as that is
the only available valid thread ID).

Despite the problems with the proc mutex code I cannot identify one
which would cause the test failure I am seeing.  I'm running the test
on an SMP (dual P3) Linux machine and it fails about one run in three.
Here is the output of a typical failure

  $ ./testprocmutex 
  APR Proc Mutex Test
  ==============

  Exclusive lock test
      Initializing the lock                                   OK
      Starting all of the processes                           OK
      Waiting for processes to exit                           OK
  Locks don't appear to work!  x = 15998 instead of 16000

The test is multi-process, but the processes are single threaded.  I
guess the problem lies somewhere in the semaphores, but I don't have
any experience of using those.

-- 
Philip Martin

Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Sun, Nov 17, 2002 at 06:06:20PM -0800, Aaron Bannert wrote:
> On Sat, Nov 16, 2002 at 06:05:05PM -0800, Brian Pane wrote:
> > Intraprocess mutexes have the same problem.  Because mutex->owner
> > is of type apr_os_thread_t, there's no guarantee that it's small
> > enough to be read or written atomically.
> 
> apr_thread_mutex_t will have a problem if mutex->nested can not be
> atomically loaded. I can't come up with a good way support nested
> mutexes w/o introducing another mutex though, any ideas?

I take that back, you are correct, mutex->owner will be the problem.

-aaron

Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

On Sat, Nov 16, 2002 at 06:05:05PM -0800, Brian Pane wrote:
> Intraprocess mutexes have the same problem.  Because mutex->owner
> is of type apr_os_thread_t, there's no guarantee that it's small
> enough to be read or written atomically.

apr_thread_mutex_t will have a problem if mutex->nested can not be
atomically loaded. I can't come up with a good way support nested
mutexes w/o introducing another mutex though, any ideas?

-aaron

Re: apr_proc_mutex is broken

Posted by Brian Pane <br...@cnet.com>.

On Sat, 2002-11-16 at 17:39, Aaron Bannert wrote:
> Looks like a bug to me. Any reason we need nested locks on cross-process
> mutexes? I'm tempted to just rip out that part and see what breaks.

Intraprocess mutexes have the same problem.  Because mutex->owner
is of type apr_os_thread_t, there's no guarantee that it's small
enough to be read or written atomically.

Brian

Re: apr_proc_mutex is broken

Posted by Aaron Bannert <aa...@clove.org>.

Looks like a bug to me. Any reason we need nested locks on cross-process
mutexes? I'm tempted to just rip out that part and see what breaks.

-aaron


On Sat, Nov 16, 2002 at 11:36:20PM +0000, Philip Martin wrote:
> Hello
> 
> The following issue was raised in the Subversion issue tracker
> 
> http://subversion.tigris.org/issues/show_bug.cgi?id=996
> 
> "While tracking down Bug 995, I've found out that apr's testglobalmutex
>  and testprocmutex are failing for me when running in SMP mode on a
>  Dual P3.  The same tests do not fail when running in Single Processor
>  mode or running on my Dual Athlon in SMP mode.  I am not certain if
>  this related to Bug 995, but it does appear that httpd is going into a
>  deadlock and the subversion cmdline client eventually times out.  I've
>  also noticed that apache leaks semaphores when this happens (from
>  ipcs)."
> 
> I'm not really sure why the submitter chose to report it to
> Subversion, and not (as far as I can tell) to APR, but I took a look
> at the proc_mutex code and there does appear to be a bug.
> 
> The documentation for apr_proc_mutex_create states that a proc mutex
> is intended to synchronize processes.  Consider
> 
> APR_DECLARE(apr_status_t) apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
> {
>     apr_status_t rv;
> 
> #if APR_HAS_THREADS
>     if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
>         mutex->owner_ref++;
>         return APR_SUCCESS;
>     }
> #endif
> 
>     if ((rv = mutex->meth->acquire(mutex)) != APR_SUCCESS) {
>         return rv;
>     }
> 
> #if APR_HAS_THREADS
>     mutex->owner = apr_os_thread_current();
>     mutex->owner_ref = 1;
> #endif
> 
>     return APR_SUCCESS;
> }
> 
> The first problem is the line
>     if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
> where there is access to the shared data mutex->owner without any sort
> of synchronization.  Now mutex->owner may not be an atomic type, in
> which case a totally bogus value could be obtained, and if it is an
> atomic type the unsynchronized access is still pointless.
> 
> However the real problem is that this is supposed to be a *process*
> lock, and yet it is comparing *thread* IDs.  Thread IDs are distinct
> within a process, but there is no guarantee that they are distinct
> across mutiple processes.  Comparing thread IDs from two separate
> processes is undefined behaviour when using POSIX threads.  There is
> at least one common platform (GNU glibc threads) where thread IDs are
> duplicated.  When I run APR's testprocmutex test on a 2-way SMP Linux
> box it regularly fails (that it doesn't always fail is, I suspect,
> because it is not a particularly good test and so the processes often
> complete without any mutex contention).
> 
> -- 
> Philip Martin