You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Clebert Suconic (Jira)" <ji...@apache.org> on 2021/08/09 17:56:05 UTC

[jira] [Closed] (ARTEMIS-3337) CORE Client sessions are closed when there are errors during failover

     [ https://issues.apache.org/jira/browse/ARTEMIS-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clebert Suconic closed ARTEMIS-3337.
------------------------------------
    Fix Version/s: 2.18.0
       Resolution: Fixed

> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-3337
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.17.0
>         Environment: {noformat}
>  {noformat}
>            Reporter: Markus Meierhofer
>            Priority: Major
>             Fix For: 2.18.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers all sessions from the old to the new connection after it's able to establish the new connection. When the session transfer to the new connection fails, it assumes the new connection is also broken and does the same steps again (get new connection, transfer all sessions) until the reconnectAttempts have been reached. But after this failed session transfer, there seems to be a bug that sessions can get closed.
>  
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After reconnect, the sessions were transferred, but one of the sessions couldn't be transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)  org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 32
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send a message, but got the error "Connection is destroyed". This error is only printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.6) timeout=1800), transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved exception!
> javax.jms.JMSException: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45]
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception error code is: FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on that session, it failed with "Connection is destroyed". Other sessions worked fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, timestamp=2021-05-26T22:24:28.508373Z),Reply to send: RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to transfer the session to the new connection. But if handleFailover fails for one session, the reconnect immediately returns and tries again:
>  
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
>          if (!session.handleFailover(connection, cause)) {
>             return false;
>          }
> }{noformat}
>  
> The issue here is that "handleFailover" is not executed for every session:
>  * During handleFailover, "transferConnection" is called on the session's ChannelImpl, which removes the session's channel from the old connection and puts it onto the new one. If transferConnection is not executed for a session, this session's channel stays on the old connection.
>  * During connection destroy, every channel set on that connection is closed via Channel.close
>  * Now if one of the sessions wouldn't successfully execute handleFailover, the method would immediately return and not remove the remaining session's channels from the old connection. And as before reconnecting, the old connection (which still contains some session channels) is destroyed, those session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions, the "closed" flag is already set on their channels, so they will reject future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed from the old connection) before the old connection is destroyed, even if an earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)