You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/06/14 20:37:00 UTC

[jira] [Commented] (ARTEMIS-3337) CORE Client sessions are closed when there are errors during failover

    [ https://issues.apache.org/jira/browse/ARTEMIS-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363197#comment-17363197 ] 

ASF subversion and git services commented on ARTEMIS-3337:
----------------------------------------------------------

Commit 3b1f6eee7d942706f40b4e35f1dc9dd0537b42a7 in activemq-artemis's branch refs/heads/main from Markus Meierhofer
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=3b1f6ee ]

ARTEMIS-3337: Correctly handle multiple connection failures

Previously, when during reconnect one session couldn't be transferred
to the new connection, we instantly returned and didn't execute failover
for the other sessions. This produced the issue that for sessions
where no failover was executed, their channels were still present on the
old connection. When the old connection was then destroyed, these channels
were closed although the reconnect was still ongoing, which lead to
"dead" sessions.

Now, if a session failover fails, for the remaining sessions the "client-side" part
of failover is executed, which removes the sessions from the old connection so that
they are not closed when the old connection is closed afterwards.


> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-3337
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.17.0
>         Environment: {noformat}
>  {noformat}
>            Reporter: Markus Meierhofer
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers all sessions from the old to the new connection after it's able to establish the new connection. When the session transfer to the new connection fails, it assumes the new connection is also broken and does the same steps again (get new connection, transfer all sessions) until the reconnectAttempts have been reached. But after this failed session transfer, there seems to be a bug that sessions can get closed.
>  
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After reconnect, the sessions were transferred, but one of the sessions couldn't be transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)  org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 32
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send a message, but got the error "Connection is destroyed". This error is only printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.6) timeout=1800), transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved exception!
> javax.jms.JMSException: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45]
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception error code is: FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on that session, it failed with "Connection is destroyed". Other sessions worked fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, timestamp=2021-05-26T22:24:28.508373Z),Reply to send: RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to transfer the session to the new connection. But if handleFailover fails for one session, the reconnect immediately returns and tries again:
>  
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
>          if (!session.handleFailover(connection, cause)) {
>             return false;
>          }
> }{noformat}
>  
> The issue here is that "handleFailover" is not executed for every session:
>  * During handleFailover, "transferConnection" is called on the session's ChannelImpl, which removes the session's channel from the old connection and puts it onto the new one. If transferConnection is not executed for a session, this session's channel stays on the old connection.
>  * During connection destroy, every channel set on that connection is closed via Channel.close
>  * Now if one of the sessions wouldn't successfully execute handleFailover, the method would immediately return and not remove the remaining session's channels from the old connection. And as before reconnecting, the old connection (which still contains some session channels) is destroyed, those session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions, the "closed" flag is already set on their channels, so they will reject future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed from the old connection) before the old connection is destroyed, even if an earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)