You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@activemq.apache.org by Takeshi Fukushima <ta...@gmail.com> on 2022/01/20 17:08:17 UTC

Possible deadlock with MQTT

Hello everyone

Last month I've opened up a ticket regarding a deadlock that was occurring
on one of our servers.
https://issues.apache.org/jira/browse/ARTEMIS-3622

After much trial and error, I think I've narrowed it down the cause and
it's in the comments, but being a racing condition I cannot reliably
reproduce in a test case. I can reproduce manually (carefully advancing the
threads in the debugger).

I also did patch our server with the quickfix indicated in the comments and
I will let you know if that fixes the problem (so far so good, but its been
only about 18 hours). If everything goes well, I will create a PR but I
would like to know if a simple solution like that (marking a couple of
methods as synchronized) would be acceptable, even though I think it
doesn't fixes all the cases (that one would require much in depth changes I
think).

Thanks for the great software anyway,
takeshi

-- 
http://mapsdev.blogspot.com/
Marcelo Takeshi Fukushima

Re: Possible deadlock with MQTT

Posted by Takeshi Fukushima <ta...@gmail.com>.
Hi all

I've managed to write a test that somewhat reliably reproduces the error.
It also, sometimes, throws a NPE inside MQTTConnectionManager.connect (and
I think they have the same source, a racing condition on MQTTSession.state)
It cannot be merged into anything serious because it has some number of
Thread.sleep( ) calls to interleave calls in just the right times to cause
the deadlock. If there's any interest, I'd be happy to push it to somewhere
public so its easier to try to fix the error.
For what's worth, our patched version has been running with my 'fix' for
about a week without any issue (it was happening anywhere between 1 and 5
times a week).

Again, thanks for the great software
takeshi

On Thu, Jan 20, 2022 at 5:58 PM Takeshi Fukushima <ta...@gmail.com>
wrote:

> Oh I understand that you didn't specifically target the issue. I was just
> hoping that it was a 'collateral damage' of a large refactor =P.
>
> I also agree 100% with your comment about removing synchronization and I
> double that if we are talking about global variables.
>
> Thanks for hearing me out,
> takeshi
>
> On Thu, Jan 20, 2022 at 5:04 PM Justin Bertram <jb...@apache.org>
> wrote:
>
>> To be clear, I didn't expect my PR to resolve the issue. My point simply
>> was that it's a large change, and I don't want to have to resolve the
>> conflicts that may arise from fixing the deadlock before the PR is merged.
>>
>> In my opinion, all the synchronization around the session state is not
>> ideal. I'd like to find a way to refactor it completely to employ
>> something
>> more elegant.
>>
>>
>> Justin
>>
>> On Thu, Jan 20, 2022 at 1:32 PM Takeshi Fukushima <ta...@gmail.com>
>> wrote:
>>
>> > Hi Justin
>> >
>> > I realize that my fix is not the best, I just don't know the codebase
>> well
>> > enough to do a better job. That said, the methods that I've marked as
>> > synchronized (MQTTConnectionManager connect and disconnect) already
>> > synchronize themselves on the session state, but the disconnect method
>> > synchronizes on the sessionState, which might get swapped before it has
>> a
>> > chance to call (synchronized) MQTTSession.stop (which in turn, calls the
>> > synchronized MQTTSession.clear but on a different instance of the
>> session).
>> > On top of that, since a freshly created MQTTSession uses a global
>> > MQTTSessionState to begin with (and some methods synchronizes over the
>> > state), I think this causes more (unintended) contention and may be
>> another
>> > source of deadlocks.
>> >
>> > I've looked at your PR and, unfortunately, the issue still remains as
>> far
>> > as I can tell. I will be happy to help if you need anything and I will
>> do
>> > my best to create a testcase that can at least sometimes reproduce the
>> > problem (which unfortunately depends all too much on the thread
>> scheduling
>> > order).
>> >
>> > Thanks for the response,
>> > takeshi
>> >
>> > On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jb...@apache.org>
>> > wrote:
>> >
>> > > Making a method synchronized typically isn't the preferred solution to
>> > > concurrency issues due to the performance penalty it typically
>> imposes.
>> > > Generally targeted locks are better. However, concurrency issues are
>> > often
>> > > quite difficult to diagnose and resolve so sometimes a less-than-ideal
>> > > solution is better than broken code.
>> > >
>> > > I currently have a large PR [1] for MQTT 5 support that is yet to be
>> > > merged. Once that is merged I'll take a closer look at the deadlock
>> issue
>> > > from ARTEMIS-3622 that you reported.
>> > >
>> > >
>> > > Justin
>> > >
>> > > [1] https://github.com/apache/activemq-artemis/pull/3907
>> > >
>> > > On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <
>> takeshi10@gmail.com>
>> > > wrote:
>> > >
>> > > > Hello everyone
>> > > >
>> > > > Last month I've opened up a ticket regarding a deadlock that was
>> > > occurring
>> > > > on one of our servers.
>> > > > https://issues.apache.org/jira/browse/ARTEMIS-3622
>> > > >
>> > > > After much trial and error, I think I've narrowed it down the cause
>> and
>> > > > it's in the comments, but being a racing condition I cannot reliably
>> > > > reproduce in a test case. I can reproduce manually (carefully
>> advancing
>> > > the
>> > > > threads in the debugger).
>> > > >
>> > > > I also did patch our server with the quickfix indicated in the
>> comments
>> > > and
>> > > > I will let you know if that fixes the problem (so far so good, but
>> its
>> > > been
>> > > > only about 18 hours). If everything goes well, I will create a PR
>> but I
>> > > > would like to know if a simple solution like that (marking a couple
>> of
>> > > > methods as synchronized) would be acceptable, even though I think it
>> > > > doesn't fixes all the cases (that one would require much in depth
>> > > changes I
>> > > > think).
>> > > >
>> > > > Thanks for the great software anyway,
>> > > > takeshi
>> > > >
>> > > > --
>> > > > http://mapsdev.blogspot.com/
>> > > > Marcelo Takeshi Fukushima
>> > > >
>> > >
>> >
>> >
>> > --
>> > http://mapsdev.blogspot.com/
>> > Marcelo Takeshi Fukushima
>> >
>>
>
>
> --
> http://mapsdev.blogspot.com/
> Marcelo Takeshi Fukushima
>


-- 
http://mapsdev.blogspot.com/
Marcelo Takeshi Fukushima

Re: Possible deadlock with MQTT

Posted by Takeshi Fukushima <ta...@gmail.com>.
Oh I understand that you didn't specifically target the issue. I was just
hoping that it was a 'collateral damage' of a large refactor =P.

I also agree 100% with your comment about removing synchronization and I
double that if we are talking about global variables.

Thanks for hearing me out,
takeshi

On Thu, Jan 20, 2022 at 5:04 PM Justin Bertram <jb...@apache.org> wrote:

> To be clear, I didn't expect my PR to resolve the issue. My point simply
> was that it's a large change, and I don't want to have to resolve the
> conflicts that may arise from fixing the deadlock before the PR is merged.
>
> In my opinion, all the synchronization around the session state is not
> ideal. I'd like to find a way to refactor it completely to employ something
> more elegant.
>
>
> Justin
>
> On Thu, Jan 20, 2022 at 1:32 PM Takeshi Fukushima <ta...@gmail.com>
> wrote:
>
> > Hi Justin
> >
> > I realize that my fix is not the best, I just don't know the codebase
> well
> > enough to do a better job. That said, the methods that I've marked as
> > synchronized (MQTTConnectionManager connect and disconnect) already
> > synchronize themselves on the session state, but the disconnect method
> > synchronizes on the sessionState, which might get swapped before it has a
> > chance to call (synchronized) MQTTSession.stop (which in turn, calls the
> > synchronized MQTTSession.clear but on a different instance of the
> session).
> > On top of that, since a freshly created MQTTSession uses a global
> > MQTTSessionState to begin with (and some methods synchronizes over the
> > state), I think this causes more (unintended) contention and may be
> another
> > source of deadlocks.
> >
> > I've looked at your PR and, unfortunately, the issue still remains as far
> > as I can tell. I will be happy to help if you need anything and I will do
> > my best to create a testcase that can at least sometimes reproduce the
> > problem (which unfortunately depends all too much on the thread
> scheduling
> > order).
> >
> > Thanks for the response,
> > takeshi
> >
> > On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jb...@apache.org>
> > wrote:
> >
> > > Making a method synchronized typically isn't the preferred solution to
> > > concurrency issues due to the performance penalty it typically imposes.
> > > Generally targeted locks are better. However, concurrency issues are
> > often
> > > quite difficult to diagnose and resolve so sometimes a less-than-ideal
> > > solution is better than broken code.
> > >
> > > I currently have a large PR [1] for MQTT 5 support that is yet to be
> > > merged. Once that is merged I'll take a closer look at the deadlock
> issue
> > > from ARTEMIS-3622 that you reported.
> > >
> > >
> > > Justin
> > >
> > > [1] https://github.com/apache/activemq-artemis/pull/3907
> > >
> > > On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <
> takeshi10@gmail.com>
> > > wrote:
> > >
> > > > Hello everyone
> > > >
> > > > Last month I've opened up a ticket regarding a deadlock that was
> > > occurring
> > > > on one of our servers.
> > > > https://issues.apache.org/jira/browse/ARTEMIS-3622
> > > >
> > > > After much trial and error, I think I've narrowed it down the cause
> and
> > > > it's in the comments, but being a racing condition I cannot reliably
> > > > reproduce in a test case. I can reproduce manually (carefully
> advancing
> > > the
> > > > threads in the debugger).
> > > >
> > > > I also did patch our server with the quickfix indicated in the
> comments
> > > and
> > > > I will let you know if that fixes the problem (so far so good, but
> its
> > > been
> > > > only about 18 hours). If everything goes well, I will create a PR
> but I
> > > > would like to know if a simple solution like that (marking a couple
> of
> > > > methods as synchronized) would be acceptable, even though I think it
> > > > doesn't fixes all the cases (that one would require much in depth
> > > changes I
> > > > think).
> > > >
> > > > Thanks for the great software anyway,
> > > > takeshi
> > > >
> > > > --
> > > > http://mapsdev.blogspot.com/
> > > > Marcelo Takeshi Fukushima
> > > >
> > >
> >
> >
> > --
> > http://mapsdev.blogspot.com/
> > Marcelo Takeshi Fukushima
> >
>


-- 
http://mapsdev.blogspot.com/
Marcelo Takeshi Fukushima

Re: Possible deadlock with MQTT

Posted by Justin Bertram <jb...@apache.org>.
To be clear, I didn't expect my PR to resolve the issue. My point simply
was that it's a large change, and I don't want to have to resolve the
conflicts that may arise from fixing the deadlock before the PR is merged.

In my opinion, all the synchronization around the session state is not
ideal. I'd like to find a way to refactor it completely to employ something
more elegant.


Justin

On Thu, Jan 20, 2022 at 1:32 PM Takeshi Fukushima <ta...@gmail.com>
wrote:

> Hi Justin
>
> I realize that my fix is not the best, I just don't know the codebase well
> enough to do a better job. That said, the methods that I've marked as
> synchronized (MQTTConnectionManager connect and disconnect) already
> synchronize themselves on the session state, but the disconnect method
> synchronizes on the sessionState, which might get swapped before it has a
> chance to call (synchronized) MQTTSession.stop (which in turn, calls the
> synchronized MQTTSession.clear but on a different instance of the session).
> On top of that, since a freshly created MQTTSession uses a global
> MQTTSessionState to begin with (and some methods synchronizes over the
> state), I think this causes more (unintended) contention and may be another
> source of deadlocks.
>
> I've looked at your PR and, unfortunately, the issue still remains as far
> as I can tell. I will be happy to help if you need anything and I will do
> my best to create a testcase that can at least sometimes reproduce the
> problem (which unfortunately depends all too much on the thread scheduling
> order).
>
> Thanks for the response,
> takeshi
>
> On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jb...@apache.org>
> wrote:
>
> > Making a method synchronized typically isn't the preferred solution to
> > concurrency issues due to the performance penalty it typically imposes.
> > Generally targeted locks are better. However, concurrency issues are
> often
> > quite difficult to diagnose and resolve so sometimes a less-than-ideal
> > solution is better than broken code.
> >
> > I currently have a large PR [1] for MQTT 5 support that is yet to be
> > merged. Once that is merged I'll take a closer look at the deadlock issue
> > from ARTEMIS-3622 that you reported.
> >
> >
> > Justin
> >
> > [1] https://github.com/apache/activemq-artemis/pull/3907
> >
> > On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <ta...@gmail.com>
> > wrote:
> >
> > > Hello everyone
> > >
> > > Last month I've opened up a ticket regarding a deadlock that was
> > occurring
> > > on one of our servers.
> > > https://issues.apache.org/jira/browse/ARTEMIS-3622
> > >
> > > After much trial and error, I think I've narrowed it down the cause and
> > > it's in the comments, but being a racing condition I cannot reliably
> > > reproduce in a test case. I can reproduce manually (carefully advancing
> > the
> > > threads in the debugger).
> > >
> > > I also did patch our server with the quickfix indicated in the comments
> > and
> > > I will let you know if that fixes the problem (so far so good, but its
> > been
> > > only about 18 hours). If everything goes well, I will create a PR but I
> > > would like to know if a simple solution like that (marking a couple of
> > > methods as synchronized) would be acceptable, even though I think it
> > > doesn't fixes all the cases (that one would require much in depth
> > changes I
> > > think).
> > >
> > > Thanks for the great software anyway,
> > > takeshi
> > >
> > > --
> > > http://mapsdev.blogspot.com/
> > > Marcelo Takeshi Fukushima
> > >
> >
>
>
> --
> http://mapsdev.blogspot.com/
> Marcelo Takeshi Fukushima
>

Re: Possible deadlock with MQTT

Posted by Takeshi Fukushima <ta...@gmail.com>.
Hi Justin

I realize that my fix is not the best, I just don't know the codebase well
enough to do a better job. That said, the methods that I've marked as
synchronized (MQTTConnectionManager connect and disconnect) already
synchronize themselves on the session state, but the disconnect method
synchronizes on the sessionState, which might get swapped before it has a
chance to call (synchronized) MQTTSession.stop (which in turn, calls the
synchronized MQTTSession.clear but on a different instance of the session).
On top of that, since a freshly created MQTTSession uses a global
MQTTSessionState to begin with (and some methods synchronizes over the
state), I think this causes more (unintended) contention and may be another
source of deadlocks.

I've looked at your PR and, unfortunately, the issue still remains as far
as I can tell. I will be happy to help if you need anything and I will do
my best to create a testcase that can at least sometimes reproduce the
problem (which unfortunately depends all too much on the thread scheduling
order).

Thanks for the response,
takeshi

On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jb...@apache.org> wrote:

> Making a method synchronized typically isn't the preferred solution to
> concurrency issues due to the performance penalty it typically imposes.
> Generally targeted locks are better. However, concurrency issues are often
> quite difficult to diagnose and resolve so sometimes a less-than-ideal
> solution is better than broken code.
>
> I currently have a large PR [1] for MQTT 5 support that is yet to be
> merged. Once that is merged I'll take a closer look at the deadlock issue
> from ARTEMIS-3622 that you reported.
>
>
> Justin
>
> [1] https://github.com/apache/activemq-artemis/pull/3907
>
> On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <ta...@gmail.com>
> wrote:
>
> > Hello everyone
> >
> > Last month I've opened up a ticket regarding a deadlock that was
> occurring
> > on one of our servers.
> > https://issues.apache.org/jira/browse/ARTEMIS-3622
> >
> > After much trial and error, I think I've narrowed it down the cause and
> > it's in the comments, but being a racing condition I cannot reliably
> > reproduce in a test case. I can reproduce manually (carefully advancing
> the
> > threads in the debugger).
> >
> > I also did patch our server with the quickfix indicated in the comments
> and
> > I will let you know if that fixes the problem (so far so good, but its
> been
> > only about 18 hours). If everything goes well, I will create a PR but I
> > would like to know if a simple solution like that (marking a couple of
> > methods as synchronized) would be acceptable, even though I think it
> > doesn't fixes all the cases (that one would require much in depth
> changes I
> > think).
> >
> > Thanks for the great software anyway,
> > takeshi
> >
> > --
> > http://mapsdev.blogspot.com/
> > Marcelo Takeshi Fukushima
> >
>


-- 
http://mapsdev.blogspot.com/
Marcelo Takeshi Fukushima

Re: Possible deadlock with MQTT

Posted by Justin Bertram <jb...@apache.org>.
Making a method synchronized typically isn't the preferred solution to
concurrency issues due to the performance penalty it typically imposes.
Generally targeted locks are better. However, concurrency issues are often
quite difficult to diagnose and resolve so sometimes a less-than-ideal
solution is better than broken code.

I currently have a large PR [1] for MQTT 5 support that is yet to be
merged. Once that is merged I'll take a closer look at the deadlock issue
from ARTEMIS-3622 that you reported.


Justin

[1] https://github.com/apache/activemq-artemis/pull/3907

On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <ta...@gmail.com>
wrote:

> Hello everyone
>
> Last month I've opened up a ticket regarding a deadlock that was occurring
> on one of our servers.
> https://issues.apache.org/jira/browse/ARTEMIS-3622
>
> After much trial and error, I think I've narrowed it down the cause and
> it's in the comments, but being a racing condition I cannot reliably
> reproduce in a test case. I can reproduce manually (carefully advancing the
> threads in the debugger).
>
> I also did patch our server with the quickfix indicated in the comments and
> I will let you know if that fixes the problem (so far so good, but its been
> only about 18 hours). If everything goes well, I will create a PR but I
> would like to know if a simple solution like that (marking a couple of
> methods as synchronized) would be acceptable, even though I think it
> doesn't fixes all the cases (that one would require much in depth changes I
> think).
>
> Thanks for the great software anyway,
> takeshi
>
> --
> http://mapsdev.blogspot.com/
> Marcelo Takeshi Fukushima
>