You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/06/07 14:58:00 UTC

[jira] [Work logged] (ARTEMIS-3337) CORE Client sessions are closed when there are errors during failover

     [ https://issues.apache.org/jira/browse/ARTEMIS-3337?focusedWorklogId=607927&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607927 ]

ASF GitHub Bot logged work on ARTEMIS-3337:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jun/21 14:57
            Start Date: 07/Jun/21 14:57
    Worklog Time Spent: 10m 
      Work Description: meierhofer08 opened a new pull request #3613:
URL: https://github.com/apache/activemq-artemis/pull/3613


   Previously, when during reconnect one session couldn't be transferred
   to the new connection, we instantly returned and didn't execute failover
   for the other sessions. This produced the issue that for sessions
   where no failover was executed, their channels were still present on the
   old connection. When the old connection was then destroyed, these channels
   were closed although the reconnect was still ongoing, which lead to
   "dead" sessions.
   
   Now, we always execute failover for every session so that the channels
   are guaranteed to be removed from the old connection before it is destroyed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 607927)
    Remaining Estimate: 0h
            Time Spent: 10m

> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-3337
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.17.0
>         Environment: {noformat}
>  {noformat}
>            Reporter: Markus Meierhofer
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers all sessions from the old to the new connection after it's able to establish the new connection. When the session transfer to the new connection fails, it assumes the new connection is also broken and does the same steps again (get new connection, transfer all sessions) until the reconnectAttempts have been reached. But after this failed session transfer, there seems to be a bug that sessions can get closed.
>  
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After reconnect, the sessions were transferred, but one of the sessions couldn't be transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)  org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 32
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send a message, but got the error "Connection is destroyed". This error is only printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.6) timeout=1800), transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved exception!
> javax.jms.JMSException: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45]
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
>         at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 nt.control.RestartAgentExceptionListener]: JMS exception error code is: FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on that session, it failed with "Connection is destroyed". Other sessions worked fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call reply! (Request queue: robot_2001.robot_command, Reply queue: ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, timestamp=2021-05-26T22:24:28.508373Z),Reply to send: RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads) n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: AMQ219010: Connection is destroyed
>         ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to transfer the session to the new connection. But if handleFailover fails for one session, the reconnect immediately returns and tries again:
>  
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
>          if (!session.handleFailover(connection, cause)) {
>             return false;
>          }
> }{noformat}
>  
> The issue here is that "handleFailover" is not executed for every session:
>  * During handleFailover, "transferConnection" is called on the session's ChannelImpl, which removes the session's channel from the old connection and puts it onto the new one. If transferConnection is not executed for a session, this session's channel stays on the old connection.
>  * During connection destroy, every channel set on that connection is closed via Channel.close
>  * Now if one of the sessions wouldn't successfully execute handleFailover, the method would immediately return and not remove the remaining session's channels from the old connection. And as before reconnecting, the old connection (which still contains some session channels) is destroyed, those session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions, the "closed" flag is already set on their channels, so they will reject future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed from the old connection) before the old connection is destroyed, even if an earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)