You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@felix.apache.org by "Felix Meschberger (JIRA)" <ji...@apache.org> on 2011/08/04 13:17:27 UTC

[jira] [Created] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Prevent Deadlock Situation in Felix.acquireGlobalLock
-----------------------------------------------------

                 Key: FELIX-3067
                 URL: https://issues.apache.org/jira/browse/FELIX-3067
             Project: Felix
          Issue Type: Improvement
          Components: Framework
    Affects Versions: fileinstall-3.1.10, framework-3.2.1, framework-3.2.0, framework-3.0.9, framework-3.0.8, framework-3.0.7
            Reporter: Felix Meschberger


Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:

(a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.

(b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock

(c) Eclipse Gemini Blueprint implementation which operates asynchronously

(d) synchronization in application classes

Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.

This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527 ] 

Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 11:04 AM:
-----------------------------------------------------------------------

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                
      was (Author: bdelacretaz):
    I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use 

{code}
* r
{code}

to start all tasks, at which point the tool should display something like

{code}
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 
{code}

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                  
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch, FELIX-3067-sling.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Ancoron Luciferis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267290#comment-13267290 ] 

Ancoron Luciferis commented on FELIX-3067:
------------------------------------------

We're also hit by this this issue since the beginning as we have a lot of bundles and using those inside GlassFish 3.1.x together with Apache Aries Blueprints and FileInstall.

So, we also would like to see Felix.acquireGlobalLock deadlock problem being fixed.

However, I also foresee framework config parameters here for the number of "retries" and the timeout per try.
                
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Felix Meschberger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Felix Meschberger updated FELIX-3067:
-------------------------------------

    Attachment: FELIX-3067.patch

Attaching patch implementing the timeout. This patch addes the following:

  * A timeout value is added to the wait call (1 second)
  * The loop is run at most 10 times

Thus after rougly ten seconds acquireGlobalLock will time out and write an error message to the Logger.

> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527 ] 

Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 11:06 AM:
-----------------------------------------------------------------------

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario (using a 1.6.0_37 JVM on macosx):
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks - but they do.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                
      was (Author: bdelacretaz):
    I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                  
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch, FELIX-3067-sling.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bertrand Delacretaz updated FELIX-3067:
---------------------------------------

    Attachment: FELIX-3067-sling.patch

Patch to use the Felix framework + scr snapshots in Sling launchpad
                
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch, FELIX-3067-sling.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527 ] 

Bertrand Delacretaz commented on FELIX-3067:
--------------------------------------------

5.6 deadlock, stress test tool?
jenkins tests
Sling log markers

from53 test, not enough memory for 5.6, uses plain java instead of /usr/java/jdk1.6.0_35/bin/java ?

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use 

{code}
* r
{code}

to start all tasks, at which point the tool should display something like

{code}
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 
{code}

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527 ] 

Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 10:54 AM:
-----------------------------------------------------------------------

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use 

{code}
* r
{code}

to start all tasks, at which point the tool should display something like

{code}
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 
{code}

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                
      was (Author: bdelacretaz):
    5.6 deadlock, stress test tool?
jenkins tests
Sling log markers

from53 test, not enough memory for 5.6, uses plain java instead of /usr/java/jdk1.6.0_35/bin/java ?

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use 

{code}
* r
{code}

to start all tasks, at which point the tool should display something like

{code}
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 
{code}

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired.
                  
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch
>
>
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems to already be provision to handle this case since acquireGlobalLock returns true only if the global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira