You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@felix.apache.org by David Jencks <da...@yahoo.com> on 2012/02/15 07:55:20 UTC

Concurrency problems in SCR

see FELIX-3345

We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.

To try to pique your interest here is one of the stack traces:

Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
at java.lang.Thread.run(Thread.java:680) 
Caused by: java.lang.NullPointerException 
at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
at java.security.AccessController.doPrivileged(Native Method) 
at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
... 14 more 

I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).

many thanks
david jencks


Re: Concurrency problems in SCR

Posted by Felix Meschberger <fm...@adobe.com>.
Hi David,

Sorry for the delay ...

Am 21.02.2012 um 23:56 schrieb David Jencks:

> Hi Felix,
> 
> Sorry for the delay, I've been struggling to come up with an independent way to reproduce this and finally have something similar in an integration test which I've attached to the jira with a stack trace.
> 
> By messing with the timing I've gotten into a state where we're trying to log on a BundleComponentActivator that has been shut down.  The activity being logged is different, but the NPE is the same.  Admittedly I added the log statement in question....
> 
> The basic scenario is two threads shutting stuff down concurrently, and an unbind method that takes a long time to return.
> 
> bundle A has a service s1
> bundle B has a service s2 with a reference to s1, and an unbind method that takes a long time.
> 
> Stop A in thread 1; the thread gets stuck in unbind of s2.
> 
> Stop B in thread 2; this shuts down B so it's BundleComponentActivator no longer can log (the NPE).
> 
> If the unbind method in thread 1 now returns, it will try to log in B's BCA which has no log service tracker.
> 
> At the moment I'm inclined to think that there is no way to prevent these kinds of races between bundles shutting down and the best thing to do is null checks in many places and catching exceptions from attempts to use stopped bundle contexts; this is what my first two patches do.

Yes, I agree -- I've also done the null-check thing a couple of places.

> 
> BTW to get my test to work at all I started upgrading scr to more recent pax bundles.  I think this would slightly simplify the code.  Also I think the more modern way to run integration tests is with the maven-failsafe-plugin rather than a separate surefire plugin execution.  Would you be interested in a patch for this?

Yes. I once tried upgrading but failed because I don't know too much of it and I used to use features, which are not directly supported any longer ...

Regards
Felix

> 
> thanks
> david jencks
> 
> 
> On Feb 16, 2012, at 12:41 AM, Felix Meschberger wrote:
> 
>> Hi David,
>> 
>> Thanks for reporting. I have seen the reported issue. I will look into it ASAP.
>> 
>> My guts feeling tells me that some issues are just a question of doing null-checks properly while others might be more involved ...
>> 
>> I assume you have a scenario where you can reproduce reliably ?
>> 
>> Regards
>> Felix
>> 
>> Am 15.02.2012 um 07:55 schrieb David Jencks:
>> 
>>> see FELIX-3345
>>> 
>>> We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.
>>> 
>>> To try to pique your interest here is one of the stack traces:
>>> 
>>> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
>>> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
>>> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
>>> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
>>> at java.lang.Thread.run(Thread.java:680) 
>>> Caused by: java.lang.NullPointerException 
>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
>>> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
>>> at java.security.AccessController.doPrivileged(Native Method) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
>>> ... 14 more 
>>> 
>>> I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
>>> 
>>> many thanks
>>> david jencks
>>> 
>> 
> 


Re: Concurrency problems in SCR

Posted by Felix Meschberger <fm...@adobe.com>.
Hi David,

Am 02.03.2012 um 15:07 schrieb David Jencks:

> Hi,
> 
> We've now run into the problem I found in an integration test in a "real life" server.
> 
> It would be really great if you could review and either apply or comment on these patches:
> 
> FELIX-3345-2.diff fixes the initial set of concurrency problems we found
> FELIX-3345-5.diff fixes the problem found in the integration test

Great. Thanks. I have applied them in Revs. 1298266 and 1298267. Ignored the other patches assumed them to be not needed, correct ?

> 
> In addition I've updated the build from the truly antique pax components currently used to the latest ones and proposed running the integration tests with the failsafe plugin.  These patches are in
> FELIX-3371-3.diff 
> or
> FELIX-3345-6.diff
> FELIX-3371-1.diff
> FELIX-3371-2.diff

Applied FELIX-3371-3.diff in Rev. 1298268

Ignored the FELIX-3371-1.diff and FELIX-3371-2.diff as well as the missing FELIX-3345-6.diff.

> 
> I really don't know how to investigate these problems further without more advice.   At the moment I think that the best way to fix them is with null and invalidity checks, as done in my patches.  If anyone has other ideas I can try to check them out in the environment where we see these problems intermittently.

I think this is really best we can do. We cannot prevent these concurrency issues at all without risking deadlocks, which IMHO is much worse.

Thanks alot for the diligent work you put into this. Very much appreciated.

Regards
Felix

> 
> thanks!
> david jencks
> 
> 
> 
> On Feb 21, 2012, at 11:56 PM, David Jencks wrote:
> 
>> Hi Felix,
>> 
>> Sorry for the delay, I've been struggling to come up with an independent way to reproduce this and finally have something similar in an integration test which I've attached to the jira with a stack trace.
>> 
>> By messing with the timing I've gotten into a state where we're trying to log on a BundleComponentActivator that has been shut down.  The activity being logged is different, but the NPE is the same.  Admittedly I added the log statement in question....
>> 
>> The basic scenario is two threads shutting stuff down concurrently, and an unbind method that takes a long time to return.
>> 
>> bundle A has a service s1
>> bundle B has a service s2 with a reference to s1, and an unbind method that takes a long time.
>> 
>> Stop A in thread 1; the thread gets stuck in unbind of s2.
>> 
>> Stop B in thread 2; this shuts down B so it's BundleComponentActivator no longer can log (the NPE).
>> 
>> If the unbind method in thread 1 now returns, it will try to log in B's BCA which has no log service tracker.
>> 
>> At the moment I'm inclined to think that there is no way to prevent these kinds of races between bundles shutting down and the best thing to do is null checks in many places and catching exceptions from attempts to use stopped bundle contexts; this is what my first two patches do.
>> 
>> BTW to get my test to work at all I started upgrading scr to more recent pax bundles.  I think this would slightly simplify the code.  Also I think the more modern way to run integration tests is with the maven-failsafe-plugin rather than a separate surefire plugin execution.  Would you be interested in a patch for this?
>> 
>> thanks
>> david jencks
>> 
>> 
>> On Feb 16, 2012, at 12:41 AM, Felix Meschberger wrote:
>> 
>>> Hi David,
>>> 
>>> Thanks for reporting. I have seen the reported issue. I will look into it ASAP.
>>> 
>>> My guts feeling tells me that some issues are just a question of doing null-checks properly while others might be more involved ...
>>> 
>>> I assume you have a scenario where you can reproduce reliably ?
>>> 
>>> Regards
>>> Felix
>>> 
>>> Am 15.02.2012 um 07:55 schrieb David Jencks:
>>> 
>>>> see FELIX-3345
>>>> 
>>>> We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.
>>>> 
>>>> To try to pique your interest here is one of the stack traces:
>>>> 
>>>> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
>>>> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
>>>> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
>>>> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
>>>> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
>>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
>>>> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
>>>> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
>>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
>>>> at java.lang.Thread.run(Thread.java:680) 
>>>> Caused by: java.lang.NullPointerException 
>>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
>>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
>>>> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
>>>> at java.security.AccessController.doPrivileged(Native Method) 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
>>>> ... 14 more 
>>>> 
>>>> I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
>>>> 
>>>> many thanks
>>>> david jencks
>>>> 
>>> 
>> 
> 


Re: Concurrency problems in SCR

Posted by David Jencks <da...@yahoo.com>.
Hi,

We've now run into the problem I found in an integration test in a "real life" server.

It would be really great if you could review and either apply or comment on these patches:

FELIX-3345-2.diff fixes the initial set of concurrency problems we found
FELIX-3345-5.diff fixes the problem found in the integration test

In addition I've updated the build from the truly antique pax components currently used to the latest ones and proposed running the integration tests with the failsafe plugin.  These patches are in
FELIX-3371-3.diff 
or
FELIX-3345-6.diff
FELIX-3371-1.diff
FELIX-3371-2.diff

I really don't know how to investigate these problems further without more advice.   At the moment I think that the best way to fix them is with null and invalidity checks, as done in my patches.  If anyone has other ideas I can try to check them out in the environment where we see these problems intermittently.

thanks!
david jencks



On Feb 21, 2012, at 11:56 PM, David Jencks wrote:

> Hi Felix,
> 
> Sorry for the delay, I've been struggling to come up with an independent way to reproduce this and finally have something similar in an integration test which I've attached to the jira with a stack trace.
> 
> By messing with the timing I've gotten into a state where we're trying to log on a BundleComponentActivator that has been shut down.  The activity being logged is different, but the NPE is the same.  Admittedly I added the log statement in question....
> 
> The basic scenario is two threads shutting stuff down concurrently, and an unbind method that takes a long time to return.
> 
> bundle A has a service s1
> bundle B has a service s2 with a reference to s1, and an unbind method that takes a long time.
> 
> Stop A in thread 1; the thread gets stuck in unbind of s2.
> 
> Stop B in thread 2; this shuts down B so it's BundleComponentActivator no longer can log (the NPE).
> 
> If the unbind method in thread 1 now returns, it will try to log in B's BCA which has no log service tracker.
> 
> At the moment I'm inclined to think that there is no way to prevent these kinds of races between bundles shutting down and the best thing to do is null checks in many places and catching exceptions from attempts to use stopped bundle contexts; this is what my first two patches do.
> 
> BTW to get my test to work at all I started upgrading scr to more recent pax bundles.  I think this would slightly simplify the code.  Also I think the more modern way to run integration tests is with the maven-failsafe-plugin rather than a separate surefire plugin execution.  Would you be interested in a patch for this?
> 
> thanks
> david jencks
> 
> 
> On Feb 16, 2012, at 12:41 AM, Felix Meschberger wrote:
> 
>> Hi David,
>> 
>> Thanks for reporting. I have seen the reported issue. I will look into it ASAP.
>> 
>> My guts feeling tells me that some issues are just a question of doing null-checks properly while others might be more involved ...
>> 
>> I assume you have a scenario where you can reproduce reliably ?
>> 
>> Regards
>> Felix
>> 
>> Am 15.02.2012 um 07:55 schrieb David Jencks:
>> 
>>> see FELIX-3345
>>> 
>>> We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.
>>> 
>>> To try to pique your interest here is one of the stack traces:
>>> 
>>> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
>>> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
>>> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
>>> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
>>> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
>>> at java.lang.Thread.run(Thread.java:680) 
>>> Caused by: java.lang.NullPointerException 
>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
>>> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
>>> at java.security.AccessController.doPrivileged(Native Method) 
>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
>>> ... 14 more 
>>> 
>>> I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
>>> 
>>> many thanks
>>> david jencks
>>> 
>> 
> 


Re: Concurrency problems in SCR

Posted by David Jencks <da...@yahoo.com>.
Hi Felix,

Sorry for the delay, I've been struggling to come up with an independent way to reproduce this and finally have something similar in an integration test which I've attached to the jira with a stack trace.

By messing with the timing I've gotten into a state where we're trying to log on a BundleComponentActivator that has been shut down.  The activity being logged is different, but the NPE is the same.  Admittedly I added the log statement in question....

The basic scenario is two threads shutting stuff down concurrently, and an unbind method that takes a long time to return.

bundle A has a service s1
bundle B has a service s2 with a reference to s1, and an unbind method that takes a long time.

Stop A in thread 1; the thread gets stuck in unbind of s2.

Stop B in thread 2; this shuts down B so it's BundleComponentActivator no longer can log (the NPE).

If the unbind method in thread 1 now returns, it will try to log in B's BCA which has no log service tracker.

At the moment I'm inclined to think that there is no way to prevent these kinds of races between bundles shutting down and the best thing to do is null checks in many places and catching exceptions from attempts to use stopped bundle contexts; this is what my first two patches do.

BTW to get my test to work at all I started upgrading scr to more recent pax bundles.  I think this would slightly simplify the code.  Also I think the more modern way to run integration tests is with the maven-failsafe-plugin rather than a separate surefire plugin execution.  Would you be interested in a patch for this?

thanks
david jencks


On Feb 16, 2012, at 12:41 AM, Felix Meschberger wrote:

> Hi David,
> 
> Thanks for reporting. I have seen the reported issue. I will look into it ASAP.
> 
> My guts feeling tells me that some issues are just a question of doing null-checks properly while others might be more involved ...
> 
> I assume you have a scenario where you can reproduce reliably ?
> 
> Regards
> Felix
> 
> Am 15.02.2012 um 07:55 schrieb David Jencks:
> 
>> see FELIX-3345
>> 
>> We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.
>> 
>> To try to pique your interest here is one of the stack traces:
>> 
>> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
>> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
>> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
>> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
>> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
>> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
>> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
>> at java.lang.Thread.run(Thread.java:680) 
>> Caused by: java.lang.NullPointerException 
>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
>> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
>> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
>> at java.security.AccessController.doPrivileged(Native Method) 
>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
>> ... 14 more 
>> 
>> I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
>> 
>> many thanks
>> david jencks
>> 
> 


Re: Concurrency problems in SCR

Posted by Felix Meschberger <fm...@adobe.com>.
Hi David,

Thanks for reporting. I have seen the reported issue. I will look into it ASAP.

My guts feeling tells me that some issues are just a question of doing null-checks properly while others might be more involved ...

I assume you have a scenario where you can reproduce reliably ?

Regards
Felix

Am 15.02.2012 um 07:55 schrieb David Jencks:

> see FELIX-3345
> 
> We've been seeing intermittent exceptions from SCR which generally seem to look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager, or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting bundles down at once.  I'm not making much progress investigation exactly how this happens so I'd really appreciate it if one of the experts could take a look at the stack traces in the issue and attempt to guess whether there's a real concurrency bug or if the situations we're seeing are expected when more than one thread is shutting down bundles at once, and the "don't throw an exception" patch I provided would be appropriate.
> 
> To try to pique your interest here is one of the stack traces:
> 
> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService() 
> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287) 
> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562) 
> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665) 
> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91) 
> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514) 
> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565) 
> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161) 
> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595) 
> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257) 
> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215) 
> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284) 
> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691) 
> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598) 
> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261) 
> at java.lang.Thread.run(Thread.java:680) 
> Caused by: java.lang.NullPointerException 
> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614) 
> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589) 
> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633) 
> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000) 
> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964) 
> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114) 
> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275) 
> ... 14 more 
> 
> I think the DelayedComponentManager.State here is Disposed but we've also seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
> 
> many thanks
> david jencks
>