You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@felix.apache.org by Niclas Hedhman <ni...@hedhman.org> on 2008/01/29 05:02:44 UTC

Deadlock in Felix

Hi,

I have been away for a while and don't really know if this has been looked at.

The story can be found at [1], but goes roughly like this;

The guy is starting a bunch of bundles on Felix, incl Pax Logging and Felix 
Configuration Admin.
The Felix internals will find the LogService implementation and use it for its 
own logging needs.
At the same time, the Configuration Admin gives Pax Logging a new 
Configuration, which triggers loading of classes.

Now, the StartLevel thread locks a o.a.f.module.ModuleFactoryImpl and blocks 
on the o.a.log4j.spi.RootLogger. The Configuration thread will do the 
opposite...

The immediate question; Any good ideas on how to get around this?
The only one I can think of is to start a new thread and delay the 
Configuration update() X milliseconds, but that sounds fairly britle.

The larger question is more like; Isn't it necessary for the framework to be 
lock-free when calling out to code outside its control??
(that was something I concluded with my own similar framework back in 1999)


[1] http://issues.ops4j.org/jira/browse/PAXLOGGING-20


Cheers
-- 
Niclas Hedhman, Software Developer

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Deadlock in Felix

Posted by Karl Pauls <ka...@gmail.com>.
> On Monday 11 February 2008 16:17, Karl Pauls wrote:
> > However, do you (and others) think
> > it is acceptable to deliver log events asynchronously or at least
> > later then the actual point they occur?
>
> If you are only worried about Pax Logging case, then you need to do nothing.
> The locking in Log4J is severe enough for us to consider it "our problem" and
> I will probably later in the week do the "buffering" for you, which will be
> in proper order.
>
> I am not sure if there are any other Log Services out there, which will have
> similar problems.
>
> The broader concern probably still remains; Should the framework have any
> locks in place when calling outside itself, whether being log message, events
> or callbacks? The spec is pretty silent about it, and I have no clue how
> Felix measure up against the others on this point. But I think it is a valid
> question to ask both here and to the spec guys.

Well, we only hold locks in the case of log messages. For the other calls we
should be lock free .

regards,

Karl

> Cheers
> --
> Niclas Hedhman, Software Developer
>
> I  live here; http://tinyurl.com/2qq9er
> I  work here; http://tinyurl.com/2ymelc
> I relax here; http://tinyurl.com/2cgsug
>



-- 
Karl Pauls
karlpauls@gmail.com

Re: Deadlock in Felix

Posted by "Richard S. Hall" <he...@ungoverned.org>.
Niclas Hedhman wrote:
> The broader concern probably still remains; Should the framework have any 
> locks in place when calling outside itself, whether being log message, events 
> or callbacks? The spec is pretty silent about it, and I have no clue how 
> Felix measure up against the others on this point. But I think it is a valid 
> question to ask both here and to the spec guys.

No, it shouldn't. And I do think the spec mentions this somewhere. 
However, it might not always be possible to achieve. Still, we should 
review each callback point and verify its correctness.

-> richard

Re: Deadlock in Felix

Posted by Niclas Hedhman <ni...@hedhman.org>.
On Monday 11 February 2008 16:17, Karl Pauls wrote:
> However, do you (and others) think
> it is acceptable to deliver log events asynchronously or at least
> later then the actual point they occur?

If you are only worried about Pax Logging case, then you need to do nothing. 
The locking in Log4J is severe enough for us to consider it "our problem" and 
I will probably later in the week do the "buffering" for you, which will be 
in proper order.

I am not sure if there are any other Log Services out there, which will have 
similar problems.

The broader concern probably still remains; Should the framework have any 
locks in place when calling outside itself, whether being log message, events 
or callbacks? The spec is pretty silent about it, and I have no clue how 
Felix measure up against the others on this point. But I think it is a valid 
question to ask both here and to the spec guys.


Cheers
-- 
Niclas Hedhman, Software Developer

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Deadlock in Felix

Posted by Karl Pauls <ka...@gmail.com>.
On Feb 11, 2008 9:29 AM, Felix Meschberger <fm...@gmail.com> wrote:
> Hi,
>
> Am Montag, den 11.02.2008, 09:17 +0100 schrieb Karl Pauls:
> > > Later on, I think I'll change Pax Logging to move all incoming requests (log
> > > and config) to its own thread to prevent the whole issue all together.
> >
> > Good - so we don't have to hurry. However, do you (and others) think
> > it is acceptable to deliver log events asynchronously or at least
> > later then the actual point they occur? We have the eventdispatching
> > thread around anyways and I think it should be doable to use it to
> > deliver events. But I will only spend time looking into this if the
> > general opinion is that it is worthwhile ...
>
> I am not sure, whether this would be a good solution, because you would
> then have log messages from the application and log messages from the
> framework being out of think with regard to actuall order of processing.
> This could cause some headaches when trying to interpret the log files.
>
> In addition, I tend to think, that the way Log4J locks itself and by
> loading classes inside that lock potentially the complete system is kind
> of problematic. Not sure how NLog4J and Logback act in this respect.
>
> On the Felix side, we might come up with a solution where we may
> actively delay logging when inside some synchronized/locked code.
> Probably the only really problematic location for logging is inside the
> ClassLoader support. Could we gather messages there to be logged after
> releasing any locks ?

Sure, we could do that. Let's see whether there are other opinions
else I'll give it a shot.

regards,

Karl

> Regards
> Felix
>
>



-- 
Karl Pauls
karlpauls@gmail.com

Re: Deadlock in Felix

Posted by Felix Meschberger <fm...@gmail.com>.
Hi,

Am Montag, den 11.02.2008, 09:17 +0100 schrieb Karl Pauls:
> > Later on, I think I'll change Pax Logging to move all incoming requests (log
> > and config) to its own thread to prevent the whole issue all together.
> 
> Good - so we don't have to hurry. However, do you (and others) think
> it is acceptable to deliver log events asynchronously or at least
> later then the actual point they occur? We have the eventdispatching
> thread around anyways and I think it should be doable to use it to
> deliver events. But I will only spend time looking into this if the
> general opinion is that it is worthwhile ...

I am not sure, whether this would be a good solution, because you would
then have log messages from the application and log messages from the
framework being out of think with regard to actuall order of processing.
This could cause some headaches when trying to interpret the log files.

In addition, I tend to think, that the way Log4J locks itself and by
loading classes inside that lock potentially the complete system is kind
of problematic. Not sure how NLog4J and Logback act in this respect.

On the Felix side, we might come up with a solution where we may
actively delay logging when inside some synchronized/locked code.
Probably the only really problematic location for logging is inside the
ClassLoader support. Could we gather messages there to be logged after
releasing any locks ?

Regards
Felix


Re: Deadlock in Felix

Posted by Karl Pauls <ka...@gmail.com>.
> On Saturday 02 February 2008 08:20, Karl Pauls wrote:
>
> > So would it be acceptable to deliver log calls asynchronously? If so I
> > can probably make that change quickly ...
>
> For now, the immediate problem solved by changing framework loglevel to INFO.

Ok.

> Later on, I think I'll change Pax Logging to move all incoming requests (log
> and config) to its own thread to prevent the whole issue all together.

Good - so we don't have to hurry. However, do you (and others) think
it is acceptable to deliver log events asynchronously or at least
later then the actual point they occur? We have the eventdispatching
thread around anyways and I think it should be doable to use it to
deliver events. But I will only spend time looking into this if the
general opinion is that it is worthwhile ...

regards,

Karl

> Cheers
> --
> Niclas Hedhman, Software Developer
>
> I  live here; http://tinyurl.com/2qq9er
> I  work here; http://tinyurl.com/2ymelc
> I relax here; http://tinyurl.com/2cgsug
>



-- 
Karl Pauls
karlpauls@gmail.com

Re: Deadlock in Felix

Posted by Niclas Hedhman <ni...@hedhman.org>.
On Saturday 02 February 2008 08:20, Karl Pauls wrote:

> So would it be acceptable to deliver log calls asynchronously? If so I
> can probably make that change quickly ...

For now, the immediate problem solved by changing framework loglevel to INFO.

Later on, I think I'll change Pax Logging to move all incoming requests (log 
and config) to its own thread to prevent the whole issue all together.


Cheers
-- 
Niclas Hedhman, Software Developer

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Deadlock in Felix

Posted by Karl Pauls <ka...@gmail.com>.
On Jan 30, 2008 11:31 AM, Stuart McCulloch <st...@jayway.net> wrote:
> On 30/01/2008, Felix Meschberger <fm...@gmail.com> wrote:
> >
> > Hi Niclas,
> >
> > The problem is (a) the generous synchronisation of Log4J and (b) locking
> > used by stuff used for class loading. In our projects we regularly face
> > issues between Log4J and our ClassLoader implementations synchronizing
> > on ClassLoader.loadClass().
> >
> > The deadlock occurrs because both parties - framework and Log4J - lock
> > "big" parts of their code and call to code outside of their scope while
> > being locked: the framework calls the LogService outside of the
> > framework and Log4J calls into class loading outside of Log4J.
> >
> > On solution, I could imagine, is not using Log4J, which may or may not
> > be an option. Maybe SLF4J or Logback could be an option here ? [ In
> > Sling we actually use Logback as a logging backend for our LogService
> > implementation ]
> >
> > Another solution would be to enhance the framework Logger to diable the
> > use of a LogService. E.g. by defining a framework property, which when
> > set, causes the Logger to never use the LogService.
> >
> > Both solutions don't sound right ...
>
>
> other possible solutions:
>
> a)  have a separate thread make the LogService call (fed from a queue)
>      although you'd have to be careful not to introduce other deadlocks

So would it be acceptable to deliver log calls asynchronously? If so I
can probably make that change quickly ...

regards,

Karl

> b)  delay sending log messages from critical sections of the framework
>      ie. log to a buffer, then send the messages when it's safe to do so
>
>
> Regards
> > Felix
> >
> > Am Mittwoch, den 30.01.2008, 13:51 +0800 schrieb Niclas Hedhman:
> > > On Tuesday 29 January 2008 16:55, Karl Pauls wrote:
> > > > Could you have them retry using Felix 1.0.3? This might be related to
> > > > some of the concurrency things we fixed.
> > > >
> > > > In case they can not be bothered retrying on Felix 1.0.3 then maybe
> > > > they can provide a minimal config file that only uses publicly
> > > > available bundles and has this issue (then I can look into it).
> > >
> > > I am looking at the code in the trunk, and it appears that the locking
> > that
> > > triggers this is in place.
> > >
> > > As always, it is a bit tricky to setup threading problems. SO, I would
> > like to
> > > run a "head exercise" first.
> > >
> > > 1. The Starter thread "FelixStartLevel" locks the ModuleFactoryImpl
> > instance
> > > in R4SearchPolicyCore.resolve().
> > >
> > > 2. The Configuration Admin thread "Configuration Updater" calls the
> > LogService
> > > with the Configuration instance, which leads to Log4J locks on its own
> > > RootLogger and  calls ClassLoader.loadClass() on something found in the
> > > configuration. This leads to trying to acquire the ModuleFactoryImpl
> > lock
> > > either in R4SearchPolicyCore.resolve() or in the provided stack trace
> > > R4SearchPolicyCore.getInUseCandidates() due to a ClassNotFound in the
> > > previous step.
> > >
> > > 3. The "FelixStartLevel" thread reaches
> > > m_logger.log(Logger.LOG_DEBUG, "WIRE: " + wires[wireIdx]);
> > > in R4SearchPolicyCore.createWires() and he log() method will try to
> > acquire
> > > the RootLogger lock.
> > >
> > > 4. DEADLOCK.
> > >
> > >
> > > I agree this is very special to the LogService, since Felix binds to it
> > and
> > > uses it for its internal use, and the responsibility is across two
> > different
> > > systems. Suggestions are welcome.
> > >
> > >
> > > Cheers
> >
> >
>
>
> --
> Cheers, Stuart
>



-- 
Karl Pauls
karlpauls@gmail.com

Re: Deadlock in Felix

Posted by Stuart McCulloch <st...@jayway.net>.
On 30/01/2008, Felix Meschberger <fm...@gmail.com> wrote:
>
> Hi Niclas,
>
> The problem is (a) the generous synchronisation of Log4J and (b) locking
> used by stuff used for class loading. In our projects we regularly face
> issues between Log4J and our ClassLoader implementations synchronizing
> on ClassLoader.loadClass().
>
> The deadlock occurrs because both parties - framework and Log4J - lock
> "big" parts of their code and call to code outside of their scope while
> being locked: the framework calls the LogService outside of the
> framework and Log4J calls into class loading outside of Log4J.
>
> On solution, I could imagine, is not using Log4J, which may or may not
> be an option. Maybe SLF4J or Logback could be an option here ? [ In
> Sling we actually use Logback as a logging backend for our LogService
> implementation ]
>
> Another solution would be to enhance the framework Logger to diable the
> use of a LogService. E.g. by defining a framework property, which when
> set, causes the Logger to never use the LogService.
>
> Both solutions don't sound right ...


other possible solutions:

a)  have a separate thread make the LogService call (fed from a queue)
     although you'd have to be careful not to introduce other deadlocks

b)  delay sending log messages from critical sections of the framework
     ie. log to a buffer, then send the messages when it's safe to do so

Regards
> Felix
>
> Am Mittwoch, den 30.01.2008, 13:51 +0800 schrieb Niclas Hedhman:
> > On Tuesday 29 January 2008 16:55, Karl Pauls wrote:
> > > Could you have them retry using Felix 1.0.3? This might be related to
> > > some of the concurrency things we fixed.
> > >
> > > In case they can not be bothered retrying on Felix 1.0.3 then maybe
> > > they can provide a minimal config file that only uses publicly
> > > available bundles and has this issue (then I can look into it).
> >
> > I am looking at the code in the trunk, and it appears that the locking
> that
> > triggers this is in place.
> >
> > As always, it is a bit tricky to setup threading problems. SO, I would
> like to
> > run a "head exercise" first.
> >
> > 1. The Starter thread "FelixStartLevel" locks the ModuleFactoryImpl
> instance
> > in R4SearchPolicyCore.resolve().
> >
> > 2. The Configuration Admin thread "Configuration Updater" calls the
> LogService
> > with the Configuration instance, which leads to Log4J locks on its own
> > RootLogger and  calls ClassLoader.loadClass() on something found in the
> > configuration. This leads to trying to acquire the ModuleFactoryImpl
> lock
> > either in R4SearchPolicyCore.resolve() or in the provided stack trace
> > R4SearchPolicyCore.getInUseCandidates() due to a ClassNotFound in the
> > previous step.
> >
> > 3. The "FelixStartLevel" thread reaches
> > m_logger.log(Logger.LOG_DEBUG, "WIRE: " + wires[wireIdx]);
> > in R4SearchPolicyCore.createWires() and he log() method will try to
> acquire
> > the RootLogger lock.
> >
> > 4. DEADLOCK.
> >
> >
> > I agree this is very special to the LogService, since Felix binds to it
> and
> > uses it for its internal use, and the responsibility is across two
> different
> > systems. Suggestions are welcome.
> >
> >
> > Cheers
>
>


-- 
Cheers, Stuart

Re: Deadlock in Felix

Posted by Carsten Ziegeler <cz...@apache.org>.
Felix Meschberger wrote:
> Hi Niclas,
> 
> The problem is (a) the generous synchronisation of Log4J and (b) locking
> used by stuff used for class loading. In our projects we regularly face
> issues between Log4J and our ClassLoader implementations synchronizing
> on ClassLoader.loadClass().
> 
> The deadlock occurrs because both parties - framework and Log4J - lock
> "big" parts of their code and call to code outside of their scope while
> being locked: the framework calls the LogService outside of the
> framework and Log4J calls into class loading outside of Log4J.
> 
> On solution, I could imagine, is not using Log4J, which may or may not
> be an option. Maybe SLF4J or Logback could be an option here ? [ In
> Sling we actually use Logback as a logging backend for our LogService
> implementation ]
> 
After I read Niclas problems, I think that I had this problem once or twice
in Sling as well - a very rare deadlock during a logging statement. I 
never investigated further as a restart solved the problem :) and it 
occured only for one day two weeks ago.
So I fear switching to a different logging implementation makes things 
better but not optimal :)


Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org

Re: Deadlock in Felix

Posted by Felix Meschberger <fm...@gmail.com>.
Hi Niclas,

The problem is (a) the generous synchronisation of Log4J and (b) locking
used by stuff used for class loading. In our projects we regularly face
issues between Log4J and our ClassLoader implementations synchronizing
on ClassLoader.loadClass().

The deadlock occurrs because both parties - framework and Log4J - lock
"big" parts of their code and call to code outside of their scope while
being locked: the framework calls the LogService outside of the
framework and Log4J calls into class loading outside of Log4J.

On solution, I could imagine, is not using Log4J, which may or may not
be an option. Maybe SLF4J or Logback could be an option here ? [ In
Sling we actually use Logback as a logging backend for our LogService
implementation ]

Another solution would be to enhance the framework Logger to diable the
use of a LogService. E.g. by defining a framework property, which when
set, causes the Logger to never use the LogService.

Both solutions don't sound right ...

Regards
Felix

Am Mittwoch, den 30.01.2008, 13:51 +0800 schrieb Niclas Hedhman:
> On Tuesday 29 January 2008 16:55, Karl Pauls wrote:
> > Could you have them retry using Felix 1.0.3? This might be related to
> > some of the concurrency things we fixed.
> >
> > In case they can not be bothered retrying on Felix 1.0.3 then maybe
> > they can provide a minimal config file that only uses publicly
> > available bundles and has this issue (then I can look into it).
> 
> I am looking at the code in the trunk, and it appears that the locking that 
> triggers this is in place.
> 
> As always, it is a bit tricky to setup threading problems. SO, I would like to 
> run a "head exercise" first.
> 
> 1. The Starter thread "FelixStartLevel" locks the ModuleFactoryImpl instance 
> in R4SearchPolicyCore.resolve().
> 
> 2. The Configuration Admin thread "Configuration Updater" calls the LogService 
> with the Configuration instance, which leads to Log4J locks on its own 
> RootLogger and  calls ClassLoader.loadClass() on something found in the 
> configuration. This leads to trying to acquire the ModuleFactoryImpl lock 
> either in R4SearchPolicyCore.resolve() or in the provided stack trace 
> R4SearchPolicyCore.getInUseCandidates() due to a ClassNotFound in the 
> previous step.
> 
> 3. The "FelixStartLevel" thread reaches 
> m_logger.log(Logger.LOG_DEBUG, "WIRE: " + wires[wireIdx]);
> in R4SearchPolicyCore.createWires() and he log() method will try to acquire 
> the RootLogger lock.
> 
> 4. DEADLOCK.
> 
> 
> I agree this is very special to the LogService, since Felix binds to it and 
> uses it for its internal use, and the responsibility is across two different 
> systems. Suggestions are welcome.
> 
> 
> Cheers


Re: Deadlock in Felix

Posted by Niclas Hedhman <ni...@hedhman.org>.
On Tuesday 29 January 2008 16:55, Karl Pauls wrote:
> Could you have them retry using Felix 1.0.3? This might be related to
> some of the concurrency things we fixed.
>
> In case they can not be bothered retrying on Felix 1.0.3 then maybe
> they can provide a minimal config file that only uses publicly
> available bundles and has this issue (then I can look into it).

I am looking at the code in the trunk, and it appears that the locking that 
triggers this is in place.

As always, it is a bit tricky to setup threading problems. SO, I would like to 
run a "head exercise" first.

1. The Starter thread "FelixStartLevel" locks the ModuleFactoryImpl instance 
in R4SearchPolicyCore.resolve().

2. The Configuration Admin thread "Configuration Updater" calls the LogService 
with the Configuration instance, which leads to Log4J locks on its own 
RootLogger and  calls ClassLoader.loadClass() on something found in the 
configuration. This leads to trying to acquire the ModuleFactoryImpl lock 
either in R4SearchPolicyCore.resolve() or in the provided stack trace 
R4SearchPolicyCore.getInUseCandidates() due to a ClassNotFound in the 
previous step.

3. The "FelixStartLevel" thread reaches 
m_logger.log(Logger.LOG_DEBUG, "WIRE: " + wires[wireIdx]);
in R4SearchPolicyCore.createWires() and he log() method will try to acquire 
the RootLogger lock.

4. DEADLOCK.


I agree this is very special to the LogService, since Felix binds to it and 
uses it for its internal use, and the responsibility is across two different 
systems. Suggestions are welcome.


Cheers
-- 
Niclas Hedhman, Software Developer

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug

Re: Deadlock in Felix

Posted by Karl Pauls <ka...@gmail.com>.
Could you have them retry using Felix 1.0.3? This might be related to
some of the concurrency things we fixed.

In case they can not be bothered retrying on Felix 1.0.3 then maybe
they can provide a minimal config file that only uses publicly
available bundles and has this issue (then I can look into it).

regards,

Karl

> Hi,
>
> I have been away for a while and don't really know if this has been looked at.
>
> The story can be found at [1], but goes roughly like this;
>
> The guy is starting a bunch of bundles on Felix, incl Pax Logging and Felix
> Configuration Admin.
> The Felix internals will find the LogService implementation and use it for its
> own logging needs.
> At the same time, the Configuration Admin gives Pax Logging a new
> Configuration, which triggers loading of classes.
>
> Now, the StartLevel thread locks a o.a.f.module.ModuleFactoryImpl and blocks
> on the o.a.log4j.spi.RootLogger. The Configuration thread will do the
> opposite...
>
> The immediate question; Any good ideas on how to get around this?
> The only one I can think of is to start a new thread and delay the
> Configuration update() X milliseconds, but that sounds fairly britle.
>
> The larger question is more like; Isn't it necessary for the framework to be
> lock-free when calling out to code outside its control??
> (that was something I concluded with my own similar framework back in 1999)
>
>
> [1] http://issues.ops4j.org/jira/browse/PAXLOGGING-20
>
>
> Cheers
> --
> Niclas Hedhman, Software Developer
>
> I  live here; http://tinyurl.com/2qq9er
> I  work here; http://tinyurl.com/2ymelc
> I relax here; http://tinyurl.com/2cgsug
>



-- 
Karl Pauls
karlpauls@gmail.com