You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Rob Springer <rs...@etinternational.com> on 2012/01/03 21:18:23 UTC

Broker death recovery

Hi all,
   In our application (we've tried both 0.5 and 0.12), we'd like for our 
client programs to quickly recover in the case where a broker dies. 
Currently, we're able to do this by dynamically allocating all our 
Qpid-using code, and simply re-allocating should the broker die, but 
that's seems inelegant and feels...wrong.
   If we attempt to reconnect and don't create a new Session (i.e., use 
the old one), bad things happen (since Session doesn't yet support 
resume(), I assume that's expected behavior).
   When we then try to create a new Session, a new SubscriptionManager, 
and a new Subscription, we get an assertion failure (backtrace at the 
end of this message).
   After reading the backtrace, I believe the following is happening:
1) In recovery, we attempt to assign a new Subscription to the previous 
Subscription variable (i.e., "sub = subMgr->subscribe()")
2) That causes the refcount for the old Subscription to fall to 0,
causing it to be cleaned up.
3) As part of that cleanup, the associated SubscriptionImpl object
goes to destroy its (std::auto_ptr<ScopedDivert>) demuxRule member.
4) That demuxRule member maintains a reference to a Demux object,
demuxer, which exists inside the Session object.

Thus, we have a fatal circle - we need to create a new Session object to 
be able to proceed, but when we do so, we render ourselves unable to
re-use Subscription variables.

Unfortunately, I can't think of an easy/simple fix, besides perhaps 
adding reference counting to the Demux variable...although I haven't 
thought that through at all.

I was wondering if you were aware of this sort of issue, and if so, if 
there were plans to resolve it or ideas on how to resolve it.

Thanks a ton!
-rob

Backtrace (plus a little other GDB output):
Invalid argument
myServer: 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/../include/qpid/sys/posix/Mutex.h:116: 
void qpid::sys::Mutex::lock(): Assertion `0' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff665e3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff665e3a5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff6661b0b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff6656d4d in __assert_fail () from 
/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff7b37482 in qpid::sys::Mutex::lock (this=0x6697b8)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/../include/qpid/sys/posix/Mutex.h:116
#4  0x00007ffff7b37ce3 in ScopedLock (this=0x7fffffffdbb0, l=...)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/../include/qpid/sys/Mutex.h:33
#5  0x00007ffff7b57a80 in qpid::client::Demux::remove (this=0x6697b8, 
name="queue-serverRecv")
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/Demux.cpp:105
#6  0x00007ffff7b5728c in ~ScopedDivert (this=0x6263d0, __in_chrg=<value 
optimized out>)
     at /home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/Demux.cpp:45
#7  0x00007ffff7b75f67 in ~auto_ptr (this=0x66a160, __in_chrg=<value 
optimized out>)
     at /usr/include/c++/4.6/backward/auto_ptr.h:170
#8  0x00007ffff7b77c34 in ~SubscriptionImpl (this=0x66a070, 
__in_chrg=<value optimized out>)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/SubscriptionImpl.h:43
#9  0x00007ffff7b77d82 in ~SubscriptionImpl (this=0x66a070, 
__in_chrg=<value optimized out>)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/SubscriptionImpl.h:43
#10 0x00007ffff7b2c15c in qpid::RefCounted::released (this=0x66a070)
     at /home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/RefCounted.h:48
#11 0x00007ffff7b380b7 in qpid::RefCounted::release (this=0x66a070)
     at /home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/RefCounted.h:42
#12 0x00007ffff7b5e2fe in 
boost::intrusive_ptr_release<qpid::client::SubscriptionImpl> (p=0x66a070)
     at /home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/RefCounted.h:59
#13 0x00007ffff7b74a9c in 
qpid::client::PrivateImplRef<qpid::client::Subscription>::dtor (t=...)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/PrivateImplRef.h:88
#14 0x00007ffff7b7463a in ~Subscription (this=0x7fffffffde08, 
__in_chrg=<value optimized out>)
     at 
/home/rspringer/qpid_work/qpid-0.12/cpp/src/qpid/client/Subscription.cpp:33
#15 0x000000000040c355 in ~CQpidConnection (this=0x7fffffffdd80, 
__in_chrg=<value optimized out>) at qpidConnection.h:71
#16 0x000000000040bec7 in example0 () at myServer.cpp:9
#17 0x000000000040c095 in main () at myServer.cpp:86
(gdb)

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: Broker death recovery

Posted by rspringer <rs...@etinternational.com>.
Gordon,

Gordon Sim wrote
> 
> As a workaround, can you first assign a 'null' Subscription to the 
> subscription variable and only then recreate the Session and 
> SubscriptionManager, then finally reassign the variable with the real 
> Subscription?
> 
> For an actual fix, perhaps a destructor in SubscriptionManagerImpl that 
> calls cancelDiversion() on all its Subscription instances would
> suffice(?).
> 
Good call! I had to also assign a "null" to the associated LocalQueue
object, but after doing so, it worked swimmingly. Thanks!


Gordon Sim wrote
> 
>> I was wondering if you were aware of this sort of issue, and if so, if
>> there were plans to resolve it or ideas on how to resolve it.
> 
> I wasn't aware of this specific issue. We've been encouraging people to 
> use the newer messaging API instead of this older client API. The 
> messaging API offers a cleaner, higher level abstraction that makes 
> migration to newer versions of the protocol simpler and also makes it 
> simpler to provide richer functionality behind the API (such as 
> auto-reconnect).
> 
That's a good point - we had experimented with the newer API and, indeed, we
didn't observe this or any similar problem. Unfortunately, we weren't able
(read as: allowed) to convert our existing codebase to the new API...

Thanks again - this has been a thorn in our collective side for a while!
-rob


--
View this message in context: http://qpid.2158936.n2.nabble.com/Broker-death-recovery-tp7148014p7151005.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: Broker death recovery

Posted by Gordon Sim <gs...@redhat.com>.
On 01/03/2012 08:18 PM, Rob Springer wrote:
> Hi all,
> In our application (we've tried both 0.5 and 0.12), we'd like for our
> client programs to quickly recover in the case where a broker dies.
> Currently, we're able to do this by dynamically allocating all our
> Qpid-using code, and simply re-allocating should the broker die, but
> that's seems inelegant and feels...wrong.
> If we attempt to reconnect and don't create a new Session (i.e., use the
> old one), bad things happen (since Session doesn't yet support resume(),
> I assume that's expected behavior).
> When we then try to create a new Session, a new SubscriptionManager, and
> a new Subscription, we get an assertion failure (backtrace at the end of
> this message).
> After reading the backtrace, I believe the following is happening:
> 1) In recovery, we attempt to assign a new Subscription to the previous
> Subscription variable (i.e., "sub = subMgr->subscribe()")
> 2) That causes the refcount for the old Subscription to fall to 0,
> causing it to be cleaned up.
> 3) As part of that cleanup, the associated SubscriptionImpl object
> goes to destroy its (std::auto_ptr<ScopedDivert>) demuxRule member.
> 4) That demuxRule member maintains a reference to a Demux object,
> demuxer, which exists inside the Session object.
>
> Thus, we have a fatal circle - we need to create a new Session object to
> be able to proceed, but when we do so, we render ourselves unable to
> re-use Subscription variables.
>
> Unfortunately, I can't think of an easy/simple fix, besides perhaps
> adding reference counting to the Demux variable...although I haven't
> thought that through at all.

As a workaround, can you first assign a 'null' Subscription to the 
subscription variable and only then recreate the Session and 
SubscriptionManager, then finally reassign the variable with the real 
Subscription?

For an actual fix, perhaps a destructor in SubscriptionManagerImpl that 
calls cancelDiversion() on all its Subscription instances would suffice(?).

> I was wondering if you were aware of this sort of issue, and if so, if
> there were plans to resolve it or ideas on how to resolve it.

I wasn't aware of this specific issue. We've been encouraging people to 
use the newer messaging API instead of this older client API. The 
messaging API offers a cleaner, higher level abstraction that makes 
migration to newer versions of the protocol simpler and also makes it 
simpler to provide richer functionality behind the API (such as 
auto-reconnect).


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org