You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Thomas Pohl (Jira)" <ji...@apache.org> on 2021/01/13 15:47:00 UTC
[jira] [Commented] (AMQCPP-646) Segmentation fault in TransportFilter & FailoverTransportListener

    [ https://issues.apache.org/jira/browse/AMQCPP-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264222#comment-17264222 ] 

Thomas Pohl commented on AMQCPP-646:
------------------------------------

I've seen this issue, too. It exists in two flavors. Running through the listener chain via onCommand or onException. Both boil down to the race condition destroying the ActiveMQConnection and finally the listeners. The listeners and objects they deal with are no longer fully intact.  And  The issue [https://jira.apache.org/jira/browse/AMQCPP-534] fixed some of the problems via a Disposable Default listener. But that's not enough. It just takes longer until you hit the segmentation fault.

AMQCPP-583 is also related and just another incarnation of this one. The attached patches didn't work for me fully. But I've managed to fix this with two patches:
 # set the default listener in close method of the IOTransport (see iotransport.patch) -> fixes most likely AMQCPP-583
 # synchronize this->listener in the TransportFiler. I just hit too many cases where the NULL check for this->listener was ok, but then the pointer got changed and a segfault occurs. (see transportFilter.patch) -> please note that I was trying to keep ABI compatibility in the patches and made only the most minimal changes.

To reproduce the race conditions I managed to write a small test program that just creates connections, send messages to an listener and destroy everything again in loops. When running this program in a loop for about 2 hours you most likely hit the segfault.

> Segmentation fault in TransportFilter & FailoverTransportListener
> -----------------------------------------------------------------
>
>                 Key: AMQCPP-646
>                 URL: https://issues.apache.org/jira/browse/AMQCPP-646
>             Project: ActiveMQ C++ Client
>          Issue Type: Bug
>          Components: Transports
>    Affects Versions: 3.9.3, 3.9.4
>         Environment: Red Hat Enterprise 7.6
>            Reporter: Hassan Raza
>            Assignee: Timothy A. Bish
>            Priority: Critical
>         Attachments: code.26.11.2020.patch, code.patch, iotransport.patch, transportFilter.patch
>
>
> 26/11/2020 UPDATE:
> We ran into segmentation faults even with our patched library, which looked like this:
> -|/.../libactivemq-cpp.so.19
>  -|[18] : decaf::util::concurrent::Lock::lock()+0x1f
>  -|[19] : decaf::util::concurrent::Lock::Lock(decaf::util::concurrent::Synchronizable*, bool)+0x55
>  -|[20] : activemq::core::ActiveMQConnection::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x104
>  -|[21] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[22] : activemq::transport::correlator::ResponseCorrelator::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xac
>  -|[23] : activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x305
>  -|[24] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[25] : activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6
>  -|[26] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[27] : activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x300
>  -|[28] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[29] : activemq::transport::IOTransport::fire(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9a
>  -|[30] : activemq::transport::IOTransport::run()+0xd0
>  -|[31] : +0xb7fc12
>  -|[32] : +0xb8042a
>  
> After comparing with the java equivalent of FailoverTransport, we noticed that there was a discrepancy in how the disposed listener works. It seems that the disposed listener is not initialized at all, which leads to the transport listener not putting this fallback in place, leaving dangling pointers to itself in other elements in the chain. We now ensure that there is always a valid disposed listener which will serve as the sink for any leftover calls. This new patch is attached.
>  
> 26/11/2020 END OF UPDATE
> ___________________________________________
> We're getting the following segmentation faults on a productive system with 160 cores on a fairly regular basis:
> {quote}-|/export/..../lib/libactivemq-cpp.so.19
> {{-|[7] : activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
>  {{ -|[8] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[9] : activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
>  {{ -|[10] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[11] : activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
>  {{ -|[12] : activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[13] : activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command, decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
>  {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
> {quote}
> We've narrowed the problem down to the call to Transport::setTransportListener(NULL) in TransportFilter::close() - it seems to be the same race condition reported with:
> [https://jira.apache.org/jira/browse/AMQCPP-534]
> and
> [https://jira.apache.org/jira/browse/AMQCPP-583]
>  I'm guessing the fix will also be similar - a basic patch suggestion is attached. You may want to put the static DefaultTransportListener in a common place.
> However, what is more worrying is that there appear to be several race conditions related to FailoverTransport::getTransportListener(). It provides mutex based access to the current listener, but all subsequent access to the raw pointer is unprotected. We had a concrete case in FailoverTransportListener::onCommand(). There are two calls to parent->getTransportListener() - one for the null ptr check, the second to invoke the command. In our seg fault, we observed that the first call returned a valid ptr, but the second call didn't (or returned null). Our explanation is that since getting and setting the TransportListener is protected by a mutex, another thread was able to get a hold of the mutex between the two calls and set it to FailoverTransport::setTransportListener(NULL).
> Perhaps a much better way of handling this situation would be to use shared_ptr instead of a raw pointer for:
> {{FailoverTransportImpl::TransportListener* transportListener;}}
> {{std::shared_ptr<TransportListener> FailoverTransport::getTransportListener()}}
> {{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener> listener)}}
> This way, a thread could set a new listener, and any threads still referencing the old listener would gracefully fail. We looked into providing a patch for that, but realized that would be a much bigger change involving ActiveMQConnection deriving from std::enable_shared_from_this, which would also mean replacing decaf::lang::Pointer by std::shared_ptr... Without knowing why decaf::lang::Pointer exists in the first place, it would be difficult to proceed without additional clarifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)