You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avalon.apache.org by "Peter M. Goldstein" <pe...@yahoo.com> on 2002/10/17 23:45:39 UTC

Nasty Synchronization Bug in DefaultPool

Hello All,

I'm one of the committers for the James project.  In the course of some
recent changes, we've run into a rather nasty bug in the DefaultPool
implementation.  I'd like to describe the bug, and a couple of the facts
surrounding the bug, and then get some ideas as to which of several
options constitute the best fix.

Here's the story.  Up until now James has run using a timeout mechanism
that severely hampered its maximum throughput.  This, and a severe
per-connection memory leak, limited the throughput.  Basically the SMTP
engine handled about 6 connections/second, and crashed shortly
thereafter.

At this point I and a couple of other James developers have now resolved
these two problems, resulting in a markedly improved throughput - about
50 connections/second - but we now encounter another problem.

When we run this code for any lengthy period of time, we encounter the
following exception within the first 15 minutes or so:

thread-manager: Pool interrupted while waiting for lock.
java.lang.InterruptedException
at org.apache.avalon.excalibur.pool.Mutex.acquire(Mutex.java:40)
at
org.apache.avalon.excalibur.pool.DefaultPool.put(DefaultPool.java:174)
at
org.apache.avalon.excalibur.thread.impl.WorkerThread.run(WorkerThread.ja
va:113) 

As we read the code base, this exception occurs because one of the
threads in our org.apache.avalon.excalibur.impl.DefaultThreadPool is
being returned to the pool in an interrupted state.  My first question
is, are we required to ensure that all of our threads are not
interrupted before they get returned to the pool?  I don't see any
comments that indicate this, but it is a reasonable contract for the
thread pool.  I want to know if this is indeed part of that contract.

Ok, so this doesn't look too bad at this point.  These exceptions are
relatively infrequent, and if they didn't have any terrible internal
consequences we could live with them.  And indeed, the server appears to
recover just fine and return to its former throughput for quite a while.

But this exception actually indicates a much more serious problem than
it appears.  When we examined the DefaultPool and Mutex implementations
we find the following corruption of the Mutex - 

m_mutex.acquire() in DefaultPool generates an exception, and fails to
decrement the internal Mutex m_tokens value.  m_tokens remains at 1.
The finally clause in DefaultPool is executed, which results in a call
to m_mutex.release(), which increments m_tokens to a value of 2.  This
corrupts the Mutex, and basically allows multiple threads to alter the
pool simultaneously from here on in.  This can happen multiple times,
each time increasing the maximum value of m_tokens.  This allows more
and more simultaneous modifications of the pool.  We wind up seeing
exceptions like:

java.lang.IndexOutOfBoundsException: Index: -1, Size: 41
	at java.util.ArrayList.RangeCheck(ArrayList.java:491)
	at java.util.ArrayList.remove(ArrayList.java:375)
	at
org.apache.avalon.excalibur.pool.DefaultPool.put(DefaultPool.java:176)
	at
org.apache.avalon.excalibur.thread.impl.WorkerThread.run(WorkerThread.ja
va:113)

which are clearly a result of these simultaneous modifications.

Personally, I don't understand why acquire() checks the interrupted
status at all.  It seems to me that shutdown of the pool should happen
at a higher level than the mutex, so why is the mutex sensitive to the
interrupted exception?

For the thread pool, this problem gets even worse.  The exception
propagates all the way up, causing the WorkerThread to exit.  But a
reference to the WorkerThread remains in the pool, on the active list.
This prevents the thread from being cleaned up.  It leads to "dead"
entries in the thread pool and causes the unnecessary expansion of the
pool.

Eventually these problems lead to a total corruption of the thread pool.
The thread pool starts returning null when it gets a worker thread, and
this causes all the connections to fail.  So the problem is
catastrophic.

There are obviously a bunch of not necessarily exclusive ways to address
this problem:

i) Change the mutex so it isn't sensitive to the interrupt status of the
calling thread

ii) Move the m_mutex.acquire() call out of the try/catch/finally block
that contains the m_mutex.release() call, have it record any error
state, reacquire the mutex in case of error, and remove the object from
the pool entirely in the error case.

iii) Change the thread pool to reset the interrupt status of the thread
being returned before it attempts to put it in the pool.

iv) Eliminate the mutex entirely and replace it with a synchronized
block - this may have a performance impact.

This is a big problem for James, and we'd really appreciate some help
from the Avalon developers.  Right now we're putting some finally
clauses in our run() methods to ensure that the threads are returned to
the pool in a non-interrupted state.  But we don't know if the problem
is a thread we're managing, or something inside Phoenix.  We're in the
midst of testing.  Some thoughts on the above questions and the ways in
which we might resolve these problems would be great.

--Peter



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Nasty Synchronization Bug in DefaultPool

Posted by Peter Donald <pe...@apache.org>.
On Fri, 18 Oct 2002 07:45, Peter M. Goldstein wrote:
> There are obviously a bunch of not necessarily exclusive ways to address
> this problem:

Thanks for finding this problem and giving such a good explanation.

> i) Change the mutex so it isn't sensitive to the interrupt status of the
> calling thread

We want to be able to interupt aquires so I didn't implement this but ...

> ii) Move the m_mutex.acquire() call out of the try/catch/finally block
> that contains the m_mutex.release() call, have it record any error
> state, reacquire the mutex in case of error, and remove the object from
> the pool entirely in the error case.

I applied Gregs version of this.

> iii) Change the thread pool to reset the interrupt status of the thread
> being returned before it attempts to put it in the pool.

Did this.

> iv) Eliminate the mutex entirely and replace it with a synchronized
> block - this may have a performance impact.

Thats how the pool in event package works but I didn't do this because I 
didn't want to tempt fate ;)

> This is a big problem for James, and we'd really appreciate some help
> from the Avalon developers.  Right now we're putting some finally
> clauses in our run() methods to ensure that the threads are returned to
> the pool in a non-interrupted state.  But we don't know if the problem
> is a thread we're managing, or something inside Phoenix.  We're in the
> midst of testing.  Some thoughts on the above questions and the ways in
> which we might resolve these problems would be great.

It should be fixed. Download latest excalibur source, rebuild pool and thread 
packages and it should work for you.

-- 
Cheers,

Peter Donald
---------------------------------------------------
"Therefore it can be said that victorious warriors 
win first, and then go to battle, while defeated 
warriors go to battle first, and then seek to win." 
              - Sun Tzu, the Art Of War
--------------------------------------------------- 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Nasty Synchronization Bug in DefaultPool

Posted by "Peter M. Goldstein" <pe...@yahoo.com>.
Peter and Greg,

> > Anyways, could you try the patch I am attaching and see if code
starts
> > functioning correctly? It will still be throwing those harmless
> > InterruptedException in case the thread is returned in interrupted
> > state, but the corruption should cease (unless there are other
places
> > where the pattern is not used, last time I checked there were some)
> 
> APplied the patch. Could you make sure I didn't butcher anything as I
also
> did
> a bit of hand hacking.

Thanks much.  I'll test over the course of the next 24 hours and let you
know how it turns out.

--Peter



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Nasty Synchronization Bug in DefaultPool

Posted by Peter Donald <pe...@apache.org>.
On Fri, 18 Oct 2002 08:18, Greg Steuck wrote:
> Anyways, could you try the patch I am attaching and see if code starts
> functioning correctly? It will still be throwing those harmless
> InterruptedException in case the thread is returned in interrupted
> state, but the corruption should cease (unless there are other places
> where the pattern is not used, last time I checked there were some)

APplied the patch. Could you make sure I didn't butcher anything as I also did 
a bit of hand hacking.

-- 
Cheers,

Peter Donald
*-----------------------------------------------------*
* "Faced with the choice between changing one's mind, *
* and proving that there is no need to do so - almost *
* everyone gets busy on the proof."                   *
*              - John Kenneth Galbraith               *
*-----------------------------------------------------* 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Nasty Synchronization Bug in DefaultPool

Posted by "Peter M. Goldstein" <pe...@yahoo.com>.
Greg,

Thanks for the quick response.

>     Peter> i) Change the mutex so it isn't sensitive to the interrupt
>     Peter> status of the calling thread
> 
> You do want acquire() to be interruptible.

Why?  This is what I still don't quite understand.  Why should acquire
check the interrupt state of the calling thread?  I can easily imagine
the following situation.

Object obj = pool.get();

try {
  ... some long, possibly interruptible operation
} finally {
  pool.put(obj);
}

Why should the pool.put call fail if the thread has been interrupted?

>     Peter> ii) Move the m_mutex.acquire() call out of the
>     Peter> try/catch/finally block that contains the m_mutex.release()
>     Peter> call, have it record any error state, reacquire the mutex
in
>     Peter> case of error, and remove the object from the pool entirely
>     Peter> in the error case.
> 
> I say a variation of the above is a correct fix.

Ok.  Willing to buy that.

 
>     Peter> iii) Change the thread pool to reset the interrupt status
of
>     Peter> the thread being returned before it attempts to put it in
the
>     Peter> pool.
> 
> Looks reasonable, the pool can't rely on its users to return
everything
> in clean and dandy state.

I agree.  But DefaultThreadPool is a widely used class and we clearly
don't want to change its contract.  I'd want to clarify what the
contract is first.
 
> Anyways, could you try the patch I am attaching and see if code starts
> functioning correctly? It will still be throwing those harmless
> InterruptedException in case the thread is returned in interrupted
> state, but the corruption should cease (unless there are other places
> where the pattern is not used, last time I checked there were some)

The corruption ceasing is definitely good.

The patch you attach does resolve some of the problems - the mutex is
never corrupted and the IndexOutOfBoundsExceptions cease.  However,
size() continues to give the wrong answer since the interrupted thread
is never removed from the m_active list.  So we still get an eventual
failure if we have enough interrupted threads returned to the pool.

--Peter



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Nasty Synchronization Bug in DefaultPool

Posted by Greg Steuck <gr...@nest.cx>.
>>>>> "Peter" == Peter M Goldstein <pe...@yahoo.com> writes:

    Peter> But this exception actually indicates a much more serious
    Peter> problem than it appears.  When we examined the DefaultPool
    Peter> and Mutex implementations we find the following corruption of
    Peter> the Mutex -

You are absolutely right about this being a more serious problem than it
appears to be. It is a case of mutex use-pattern violation. It's gotta
be:
    mutex.acquire();
    try {
      // do protected stuff
    } finally {
      mutex.release();
    }

Anything else is a bug waiting to happen. As you found the hard way.

    Peter> i) Change the mutex so it isn't sensitive to the interrupt
    Peter> status of the calling thread

You do want acquire() to be interruptible.

    Peter> ii) Move the m_mutex.acquire() call out of the
    Peter> try/catch/finally block that contains the m_mutex.release()
    Peter> call, have it record any error state, reacquire the mutex in
    Peter> case of error, and remove the object from the pool entirely
    Peter> in the error case.

I say a variation of the above is a correct fix.

    Peter> iii) Change the thread pool to reset the interrupt status of
    Peter> the thread being returned before it attempts to put it in the
    Peter> pool.

Looks reasonable, the pool can't rely on its users to return everything
in clean and dandy state.

Anyways, could you try the patch I am attaching and see if code starts
functioning correctly? It will still be throwing those harmless
InterruptedException in case the thread is returned in interrupted
state, but the corruption should cease (unless there are other places
where the pattern is not used, last time I checked there were some)

Thanks
Greg