You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Gary Tully (Jira)" <ji...@apache.org> on 2020/03/31 12:10:00 UTC
[jira] [Commented] (AMQ-7455) Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4

    [ https://issues.apache.org/jira/browse/AMQ-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071719#comment-17071719 ] 

Gary Tully commented on AMQ-7455:
---------------------------------

note the latest change for AMQ-5790 that retains a single connection for the rar deployment, this avoids the connection churn.

> Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-7455
>                 URL: https://issues.apache.org/jira/browse/AMQ-7455
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: RAR
>    Affects Versions: 5.15.11, 5.15.12
>            Reporter: Markus
>            Priority: Major
>              Labels: newbie, performance
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Scenario*: deploy ActiveMQ RA to EAP6.4 with XA enabled DataSources for JDBC and ActiveMQ Artemis as JMS broker in a failover config (active/passive) and run several applications that use JMS and JDBC datasources. (In our scenario, there are 4 XA JDBC DataSources and 42 deployed WAR files). ActiveMQ resource adapter is used as the Artemis one is not compatible to this EAP version.
> JBoss EAP and Artemis are running on the same host in the examples below, but this does not matter. Artemis is running with example HA-by-network-replication settings on localhost:61616 and localhost:61617.
> The resource adapter settings on JBoss EAP side and on client side (especially properties of the broker URIs for ActiveMQ's connection factory) are exactly as described in [Red Hat's reference documentation for HA config|https://access.redhat.com/documentation/en-us/red_hat_jboss_a-mq/6.1/html-single/integrating_with_jboss_enterprise_application_platform/index]:
>  * Add {{?jms.rmIdFromConnectionId=true}} to the connection URI
>  * {{<is-same-rm-override>false</is-same-rm-override>}} in standalone.xml
> *Observed Symptoms*: On a regular basis, _undeployment of WAR files take minutes (around 10) instead of seconds_, blocking other container management operations during this timeframe. This behavior can be traced back to running ActiveMQ RA and Artemis as JMS broker. If running the same WAR files and same config, but using JBoss EAP's internal HornetQ broker, these delays do not occur and undeployments take mere seconds.
> *Further Investigation*: This only affects JavaEE modules that define at least one MDB (message driven bean). During shutdown of the beans, the container management thread is blocked while de-registering the bean's XA Resource, see following stack trace taken while a simple module is being undeployed for reference:
> {noformat}
> "ServerService Thread Pool -- 277" #3709 prio=5 os_prio=0 tid=0x000000001d6e0000 nid=0x4df4 in Object.wait() [0x000000007c41e000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	at java.lang.Object.wait(Object.java:502)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.waitForScanState(XARecoveryModule.java:935)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.removeXAResourceRecoveryHelper(XARecoveryModule.java:92)
> 	- locked <0x000000008162bd38> (a java.util.concurrent.atomic.AtomicInteger)
> 	at com.arjuna.ats.jbossatx.jta.RecoveryManagerService.removeXAResourceRecovery(RecoveryManagerService.java:117)
> 	at org.jboss.jca.core.tx.jbossts.XAResourceRecoveryRegistryImpl.removeXAResourceRecovery(XAResourceRecoveryRegistryImpl.java:63)
> 	at org.jboss.jca.core.rar.EndpointImpl.deactivate(EndpointImpl.java:226)
> 	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.deactivate(MessageDrivenComponent.java:224)
> 	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.stop(MessageDrivenComponent.java:195)
> // and a couple of more elements
> {noformat}
> During these 10 minutes of blocking while the module is supposed to be undeployed, EAP's log is literally spammed with thousands of these lines:
> {noformat}
> 11:19:43,640 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> 11:19:43,643 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> 11:19:43,647 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> 11:19:43,657 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> 11:19:43,661 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> 11:19:43,664 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
> {noformat}
> Only after these lines stop to be logged the undeployment process continues.
> *What happens during this time in ActiveMQ:* Obviously there are thousands of connections established to Artemis during this time. Tracing back the code path that leads to these calls points to EAP's Periodic XA Resource Recovery thread, see the following full stack trace (taken at the same point in time):
> {noformat}
> "Periodic Recovery" #96 prio=5 os_prio=0 tid=0x000000002484a000 nid=0x1b28 waiting for monitor entry [0x0000000028dbe000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:623)
> 	- locked <0x00000000cc69eeb8> (a java.lang.Object)
> 	at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
> 	at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
> 	at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
> 	at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1392)
> 	at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1486)
> 	- locked <0x00000000cc69ff10> (a java.lang.Object)
> 	at org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:527)
> 	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.newConnection(ActiveMQResourceAdapter.java:361)
> 	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.getResourceManagerId(ActiveMQResourceAdapter.java:278)
> 	at org.apache.activemq.TransactionContext.isSameRM(TransactionContext.java:666)
> 	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.isSameRM(ActiveMQResourceAdapter.java:261)
> 	at org.jboss.jca.core.tx.jbossts.XAResourceWrapperImpl.isSameRM(XAResourceWrapperImpl.java:159)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.isSameRM(RecoveryXids.java:143)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.updateIfEquivalentRM(RecoveryXids.java:215)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.refreshXidScansForEquivalentXAResourceImpl(XARecoveryModule.java:796)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:570)
> 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:181)
> 	- locked <0x000000008162ba98> (a com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule)
> 	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)
> 	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375){noformat}
> From this stack trace I want to point out especially this line: {{org.apache.activemq.TransactionContext.isSameRM()}}
> This method is part of the XA Resource Contract. Simply put the periodic recovery process iterates over XA Resources and looks for dormant XA transactions that need to be finished in some way or the other. Here it checks different combinations of XA Resources to asses whether they came from the same Resource Manager ({{.isSameRM()}}).
>  During this recovery process (which in the stack trace above is still in the phase of gathering basic information and not actually working on XA transactions) the deregistration of the message driven beans is blocked.
> *Analysis for the super long duration of recovery process*: The {{XAResource}} instances that periodic recovery works on are created by {{org.apache.activemq.ra.ActiveMQResourceAdapter.getXAResources(ActivationSpec[])}}. The anonymous class that extends {{org.apache.activemq.TransactionContext}} and that is instantiated here does basically the following steps for each invocation of any of the XA Resource contract methods:
>  # Replace the current connection to Artemis with a freshly created one
>  # Call {{*super*}}
>  # Restore the current connection and close the new one
> Specifically for {{isSameRM}} this results in the following chain of events:
>  # Create a pristine connection to remote Artemis (in our case a FailoverTransport)
>  # Call {{*super*.isSameRM(xaresource)}}
>  # If {{xaresource}} is {{*null*}} or is not an instance of {{TransactionContext}}, return false ({{org/apache/activemq/TransactionContext.java:658}})
>  # Get the current resource manager id and {{xaresource}}'s resource manager id. This effectively calls {{*this*.getResourceManagerId()}}. As noted above, this leads to the following calls:
>  # Replace the current connection to Artemis (established in step 1) with a second pristine one
>  # Call {{*super*.getResourceManagerId()}}
>  # Call {{getResourceManagerId()}} on the connection of step 5
>  # In {{org/apache/activemq/ActiveMQConnection.java:1622}}, the resource manager id effectively is the Connection Id in our setup (according to the flag set in the connection URI).
>  # This id is _unique for each connection_ (see {{org/apache/activemq/ActiveMQConnection.java:240}})
>  # Compare the resulting value with the resource manager id of xaresource and return the result
>  # Close the connection to the JMS broker established in step 5 and set the one established in step 1 as "current" connection
>  # Close the connection established in step 1 and reset current connection to whatever was there before (in my understanding that is always {{*null*}} as the instances are always being freshly created inside {{ActiveMQResourceAdapter.getXAResources}})
> This is repeated for each pair of XA Resources, or rather Activation Specs, that the recovery manager encounters. The log spamming stems from steps 1 and 5 because each time a INFO message is logged tht tells us that a new connection has been established to the server.
> This also happens when the recovery process wants to know if a JDBC connection to our SQL server (which we have plenty of due to several modules with a connection pooling data source each) is from the same resource manager as a JMS connection to Artemis - which it obviously never is.
> *Problematic Aspects of this code*:
>  # It doesn't make sense to establish pristine connections to the remote broker for each invocation of one of {{XAResource}}'s methods. The connection setup is expensive with a plain non-failover URI as it is, but even more so with a failover transport - this has been supported by tests and also the code shows this.
>  # If there is the option to find an answer to {{isSameRM()}} without having to do any work, this should be done as early as possible.
>  # If the connection factory is configured with {{rmIdFromConnectionId}} enabled, then {{isSameRM()}} in my understanding will never return {{*true*}}.
> *Possible solutions for each of these aspects*:
>  # In order to change the basic pattern of create-connection-then-do-something-then-close-connection that happens inside  {{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}, a big portion of this code would propably have to be re-written.
>  # In the current code base, the instanceof check (see step 3 above) can be extraced and inlined into the anonymous inner class defined inside {{ActiveMQResourceAdapter.}}{{getXAResources}}{{()}}
>  # I don't know about the semantic correctness of this, i.e. if this behavior is desired; but if it is, and all the code I read strengthens this assumption, then in this case this is another candidate for short-circuting the expensive connection creation: if this config flag is set on the connection factory, directly return {{*false*}}.
> So, basically, the solutions 2 and 3 somewhat mitigate the antipattern(?) described in 1. And only this aspect is addressed by the pull request that is linked to this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)