You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@felix.apache.org by Felix Meschberger <fm...@adobe.com> on 2012/02/13 12:06:36 UTC

"Deadlock" on Felix.acquireBundleLock

Hi,

I recently got a number of stacktraces, where a variable number of threads is blocked on "m_bundleLock.wait()" inside Felix.acquireBundleLock. From pure code inspection, it should not possible for the framework to stall there.

It seems that Guillaume Nodet also hit this problem and added code to throw an IllegalStateException if the wait() call is interrupted (FELIX-2784 [1]).

Other times I hit a similar issue in the Felix.acquireGlobalLock method for which I reported FELIX-3067 [2].

It looks like both lock acquisition methods are prone to some kind of deadlock. The acquireGlobalLock method has the ability to fail by reporting such failure with a return code. The acquireBundleLock has no ability to fail (except throwing an exception).

I think similar to my FELIX-3067 proposal the acquireBundleLock method should only wait a limited time and then retry acquiring the lock. Only if this fails for a number of times, acquisition should be aborted and the method fail with an IllegalStateException.

WDYT ?

Regards
Felix

PS: We are still running 3.0.8 but the acquireBundleLock and acquireGlobalLock are the same as in trunk (except for Guillaume's FELIX-2784 fix).

[1] https://issues.apache.org/jira/browse/FELIX-2784
[2] https://issues.apache.org/jira/browse/FELIX-3067


Re: "Deadlock" on Felix.acquireBundleLock

Posted by "Richard S. Hall" <he...@ungoverned.org>.
On 2/14/12 03:51 , Felix Meschberger wrote:
> Hi,
>
> Am 13.02.2012 um 15:43 schrieb Richard S. Hall:
>
>> On 2/13/12 06:06 , Felix Meschberger wrote:
>>> Hi,
>>>
>>> I recently got a number of stacktraces, where a variable number of threads is blocked on "m_bundleLock.wait()" inside Felix.acquireBundleLock. From pure code inspection, it should not possible for the framework to stall there.
>>>
>>> It seems that Guillaume Nodet also hit this problem and added code to throw an IllegalStateException if the wait() call is interrupted (FELIX-2784 [1]).
>>>
>>> Other times I hit a similar issue in the Felix.acquireGlobalLock method for which I reported FELIX-3067 [2].
>>>
>>> It looks like both lock acquisition methods are prone to some kind of deadlock. The acquireGlobalLock method has the ability to fail by reporting such failure with a return code. The acquireBundleLock has no ability to fail (except throwing an exception).
>>>
>>> I think similar to my FELIX-3067 proposal the acquireBundleLock method should only wait a limited time and then retry acquiring the lock. Only if this fails for a number of times, acquisition should be aborted and the method fail with an IllegalStateException.
>>>
>>> WDYT ?
>> The biggest issue I have is that these sorts of fixes simply address the
>> symptoms and may make it more difficult to figure out the root cause.
>> Understandably, these issues are notoriously difficult to reproduce...
> Couldn't agree more.
>
>> Conceptually, the code should not allow deadlocks between bundle lock
>> acquirers and global lock acquirers, much like readers and writers. The
>> global lock (i.e., write lock) cannot be acquired if there are any
>> bundle lock holders (i.e., read lock).
> Yes, but I have had a situation where three threads (FelixStartLevel starting bundles, FelixPackageAdmin refreshing bundles and an installer thread installing bundles (and configuration)). All three were contending for the same bundle lock without one of these obviously holding the global lock.
>
> The result was a complete stand-still.
>
> I will try to reproduce this issue and find out and come back with my findings.

That would be great, because technically, you shouldn't be acquiring 
more than one bundle lock at a time unless you are trying to promote to 
a global lock. Bundle locks are "one at a time", although there may be 
synchronous event situations that unknowingly violate this, so we should 
look to fix those if they exist.

-> richard

>
> Regards
> Felix
>
>> Bundle lock operations should only impact one bundle at time, of course,
>> this isn't always the case since synchronous event delivery introduces
>> some difficult here. All multiple bundle operations (e.g., resolves,
>> refreshes) must acquire the global. The only difficulty is that
>> sometimes you start off with a bundle lock and need to acquire the
>> global lock later because you didn't know in advance you would need the
>> global lock (e.g., starting a bundle and then determining you need to
>> resolve it or for resolving dynamic imports), in which case the bundle
>> lock needs to be promoted to a global lock, which leads to some
>> confusing code about trying to determine if there are cycles among
>> bundles trying to promote bundle locks.
>>
>> Yeah, it's confusing and complicated, but still it would be better to
>> get a somewhat reproducible case and try to figure out if there is a
>> better solution than just introducing more error cases...
>>
>> But if that's not possible, then we can do what we have to do...
>>
>> ->  richard
>>
>>> Regards
>>> Felix
>>>
>>> PS: We are still running 3.0.8 but the acquireBundleLock and acquireGlobalLock are the same as in trunk (except for Guillaume's FELIX-2784 fix).
>>>
>>> [1] https://issues.apache.org/jira/browse/FELIX-2784
>>> [2] https://issues.apache.org/jira/browse/FELIX-3067
>>>

Re: "Deadlock" on Felix.acquireBundleLock

Posted by Felix Meschberger <fm...@adobe.com>.
Hi,

Am 13.02.2012 um 15:43 schrieb Richard S. Hall:

> On 2/13/12 06:06 , Felix Meschberger wrote:
>> Hi,
>> 
>> I recently got a number of stacktraces, where a variable number of threads is blocked on "m_bundleLock.wait()" inside Felix.acquireBundleLock. From pure code inspection, it should not possible for the framework to stall there.
>> 
>> It seems that Guillaume Nodet also hit this problem and added code to throw an IllegalStateException if the wait() call is interrupted (FELIX-2784 [1]).
>> 
>> Other times I hit a similar issue in the Felix.acquireGlobalLock method for which I reported FELIX-3067 [2].
>> 
>> It looks like both lock acquisition methods are prone to some kind of deadlock. The acquireGlobalLock method has the ability to fail by reporting such failure with a return code. The acquireBundleLock has no ability to fail (except throwing an exception).
>> 
>> I think similar to my FELIX-3067 proposal the acquireBundleLock method should only wait a limited time and then retry acquiring the lock. Only if this fails for a number of times, acquisition should be aborted and the method fail with an IllegalStateException.
>> 
>> WDYT ?
> 
> The biggest issue I have is that these sorts of fixes simply address the 
> symptoms and may make it more difficult to figure out the root cause. 
> Understandably, these issues are notoriously difficult to reproduce...

Couldn't agree more.

> 
> Conceptually, the code should not allow deadlocks between bundle lock 
> acquirers and global lock acquirers, much like readers and writers. The 
> global lock (i.e., write lock) cannot be acquired if there are any 
> bundle lock holders (i.e., read lock).

Yes, but I have had a situation where three threads (FelixStartLevel starting bundles, FelixPackageAdmin refreshing bundles and an installer thread installing bundles (and configuration)). All three were contending for the same bundle lock without one of these obviously holding the global lock.

The result was a complete stand-still.

I will try to reproduce this issue and find out and come back with my findings.

Regards
Felix

> 
> Bundle lock operations should only impact one bundle at time, of course, 
> this isn't always the case since synchronous event delivery introduces 
> some difficult here. All multiple bundle operations (e.g., resolves, 
> refreshes) must acquire the global. The only difficulty is that 
> sometimes you start off with a bundle lock and need to acquire the 
> global lock later because you didn't know in advance you would need the 
> global lock (e.g., starting a bundle and then determining you need to 
> resolve it or for resolving dynamic imports), in which case the bundle 
> lock needs to be promoted to a global lock, which leads to some 
> confusing code about trying to determine if there are cycles among 
> bundles trying to promote bundle locks.
> 
> Yeah, it's confusing and complicated, but still it would be better to 
> get a somewhat reproducible case and try to figure out if there is a 
> better solution than just introducing more error cases...
> 
> But if that's not possible, then we can do what we have to do...
> 
> -> richard
> 
>> 
>> Regards
>> Felix
>> 
>> PS: We are still running 3.0.8 but the acquireBundleLock and acquireGlobalLock are the same as in trunk (except for Guillaume's FELIX-2784 fix).
>> 
>> [1] https://issues.apache.org/jira/browse/FELIX-2784
>> [2] https://issues.apache.org/jira/browse/FELIX-3067
>> 


Re: "Deadlock" on Felix.acquireBundleLock

Posted by "Richard S. Hall" <he...@ungoverned.org>.
On 2/13/12 06:06 , Felix Meschberger wrote:
> Hi,
>
> I recently got a number of stacktraces, where a variable number of threads is blocked on "m_bundleLock.wait()" inside Felix.acquireBundleLock. From pure code inspection, it should not possible for the framework to stall there.
>
> It seems that Guillaume Nodet also hit this problem and added code to throw an IllegalStateException if the wait() call is interrupted (FELIX-2784 [1]).
>
> Other times I hit a similar issue in the Felix.acquireGlobalLock method for which I reported FELIX-3067 [2].
>
> It looks like both lock acquisition methods are prone to some kind of deadlock. The acquireGlobalLock method has the ability to fail by reporting such failure with a return code. The acquireBundleLock has no ability to fail (except throwing an exception).
>
> I think similar to my FELIX-3067 proposal the acquireBundleLock method should only wait a limited time and then retry acquiring the lock. Only if this fails for a number of times, acquisition should be aborted and the method fail with an IllegalStateException.
>
> WDYT ?

The biggest issue I have is that these sorts of fixes simply address the 
symptoms and may make it more difficult to figure out the root cause. 
Understandably, these issues are notoriously difficult to reproduce...

Conceptually, the code should not allow deadlocks between bundle lock 
acquirers and global lock acquirers, much like readers and writers. The 
global lock (i.e., write lock) cannot be acquired if there are any 
bundle lock holders (i.e., read lock).

Bundle lock operations should only impact one bundle at time, of course, 
this isn't always the case since synchronous event delivery introduces 
some difficult here. All multiple bundle operations (e.g., resolves, 
refreshes) must acquire the global. The only difficulty is that 
sometimes you start off with a bundle lock and need to acquire the 
global lock later because you didn't know in advance you would need the 
global lock (e.g., starting a bundle and then determining you need to 
resolve it or for resolving dynamic imports), in which case the bundle 
lock needs to be promoted to a global lock, which leads to some 
confusing code about trying to determine if there are cycles among 
bundles trying to promote bundle locks.

Yeah, it's confusing and complicated, but still it would be better to 
get a somewhat reproducible case and try to figure out if there is a 
better solution than just introducing more error cases...

But if that's not possible, then we can do what we have to do...

-> richard

>
> Regards
> Felix
>
> PS: We are still running 3.0.8 but the acquireBundleLock and acquireGlobalLock are the same as in trunk (except for Guillaume's FELIX-2784 fix).
>
> [1] https://issues.apache.org/jira/browse/FELIX-2784
> [2] https://issues.apache.org/jira/browse/FELIX-3067
>