You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Markus Meierhofer (Jira)" <ji...@apache.org> on 2020/11/05 15:31:00 UTC
[jira] [Comment Edited] (ARTEMIS-2870) CORE connection failure sometimes doesn't cleanup sessions

    [ https://issues.apache.org/jira/browse/ARTEMIS-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226771#comment-17226771 ] 

Markus Meierhofer edited comment on ARTEMIS-2870 at 11/5/20, 3:30 PM:
----------------------------------------------------------------------

Hello!

 

We unfortunately still couldn't reproduce the issue with Artemis version 2.15.0 with any test script, but we experienced the issue occurring on a production site (version 2.15.0) again.

We tried to document the scenario as detailed as possible, so that the issue may be easier to find.

 

How the issue was traced in the Artemis Management Console

When checking the "Queues" section, we sorted by "Consumer Count" and saw that there are multiple consumers connected on many of the queues.
 This is usually not possible in our system as we use the JMS abstraction where there's only one consumer per queue:

!multiple_consumers_per_queue.png!

We listed the "Consumers" of a specific queue (Note the Remote Address "10.0.0.2:46822", which we will refer to later in the artemis logs):

!consumer_list_for_one_queue.png!

We inspected the oldest session in the list, "16edc17c..." (note the connection ID 7efb3c90):

!session_with_connection_id.png!

When clicking on the connection ID (searching for a connection with ID 7efb3c90), no connection was found:

!connection_nonexistent.png!

Also, we checked the list of all connection, the connection with ID 7efb3c90 is not present there:

!all_connections_list.png!

 

*How the issue can be traced in the broker logs*

In the broker logs, we checked the dead connection via its Remote Address (is the same for every session in that connection) "10.0.0.2:46822".
 We were able to find a log line stating that the connection was closed, but there were no following lines mentioning that the sessions for this connection were cleaned up.
 We explicitly didn't find a "Cleared up resources for session" for session ID "16edc17c..." (The one we inspected in the management console, which links to the already closed connection):

 
{noformat}
.
.
.
[WARN 2020-11-04 09:22:16,452 Thread-21   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:46822 has been detected: AMQ229014: Did not receive data from /10.0.0.2:46822 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:25:42,464 Thread-21   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:47944 has been detected: AMQ229014: Did not receive data from /10.0.0.2:47944 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:25:58,465 Thread-3 (  org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:47946 has been detected: AMQ229014: Did not receive data from /10.0.0.2:47946 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:26:16,466 Thread-11   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48046 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48046 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:26:32,467 Thread-6 (  org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48096 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48096 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 15:36:12,023 Thread-24   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48106 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48106 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 
.
.
.{noformat}
We also attached the full broker log: [^broker.log]

 

 

*Further steps*

Please let us know whether you need more detailed infos/logs etc., so that we can provide them when the issue happens again.

While this bug still exists, we are planning to do provide a workaround by running an Artemis Client with a "Cleanup Thread" that constantly checks (via the JMX management interface) if there are sessions that link to already closed connections and, if there are, closes them. Is that possible with the methods your JMX management interface provides?

 

Thank you in advance for your help.

 

 


was (Author: mmeierhofer):
Hello!

 

We unfortunately still couldn't reproduce the issue with Artemis version 2.15.0 with any test script, but we experienced the issue occurring on a production site again.

We tried to document the scenario as detailed as possible, so that the issue may be easier to find.

 

How the issue was traced in the Artemis Management Console

When checking the "Queues" section, we sorted by "Consumer Count" and saw that there are multiple consumers connected on many of the queues.
This is usually not possible in our system as we use the JMS abstraction where there's only one consumer per queue:

!multiple_consumers_per_queue.png!

We listed the "Consumers" of a specific queue (Note the Remote Address "10.0.0.2:46822", which we will refer to later in the artemis logs):

!consumer_list_for_one_queue.png!

We inspected the oldest session in the list, "16edc17c..." (note the connection ID 7efb3c90):

!session_with_connection_id.png!

When clicking on the connection ID (searching for a connection with ID 7efb3c90), no connection was found:

!connection_nonexistent.png!

Also, we checked the list of all connection, the connection with ID 7efb3c90 is not present there:

!all_connections_list.png!

 

*How the issue can be traced in the broker logs*

In the broker logs, we checked the dead connection via its Remote Address (is the same for every session in that connection) "10.0.0.2:46822".
We were able to find a log line stating that the connection was closed, but there were no following lines mentioning that the sessions for this connection were cleaned up.
We explicitly didn't find a "Cleared up resources for session" for session ID "16edc17c..." (The one we inspected in the management console, which links to the already closed connection):

 
{noformat}
.
.
.
[WARN 2020-11-04 09:22:16,452 Thread-21   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:46822 has been detected: AMQ229014: Did not receive data from /10.0.0.2:46822 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:25:42,464 Thread-21   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:47944 has been detected: AMQ229014: Did not receive data from /10.0.0.2:47944 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:25:58,465 Thread-3 (  org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:47946 has been detected: AMQ229014: Did not receive data from /10.0.0.2:47946 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:26:16,466 Thread-11   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48046 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48046 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 09:26:32,467 Thread-6 (  org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48096 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48096 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
[WARN 2020-11-04 15:36:12,023 Thread-24   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.0.0.2:48106 has been detected: AMQ229014: Did not receive data from /10.0.0.2:48106 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 
.
.
.{noformat}
We also attached the full broker log: [^broker.log]

 

 

*Further steps*

Please let us know whether you need more detailed infos/logs etc., so that we can provide them when the issue happens again.

While this bug still exists, we are planning to do provide a workaround by running an Artemis Client with a "Cleanup Thread" that constantly checks (via the JMX management interface) if there are sessions that link to already closed connections and, if there are, closes them. Is that possible with the methods your JMX management interface provides?

 

Thank you in advance for your help.

 

 

> CORE connection failure sometimes doesn't cleanup sessions
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-2870
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2870
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.10.1, 2.14.0, 2.15.0
>            Reporter: Markus Meierhofer
>            Priority: Blocker
>         Attachments: all_connections_list.png, artemis.log, broker.log, broker.xml, connection_nonexistent.png, consumer_list_for_one_queue.png, duplicated consumers.png, multiple_consumers_per_queue.png, session_with_connection_id.png, three consumers per queue.png
>
>
> h3. Summary
> Since the upgrade of our deployed artemis instances from version 2.6.4 to 2.10.1 we have noticed the problem that sometimes, a connection failure doesn't include the cleanup of its connected sessions, leading to "zombie" consumers and producers on queues.
>  
> h3. The issue
> Our Artemis Clients are connected to the broker via the provided JMS abstraction, using the default connection TTL of 60 seconds. we are using both JMS Topics and JMS Queues.
> As most of our Clients are mobile and in a WiFi, connection losses may occur frequently, depending on the quality of the network. When the client is disconnected for 60 seconds, the broker usually closes the connection and cleans up all the sessions connected to it. The mobile Clients then create reconnect when they are online again. What we have noticed is that after many connection failures, messages may to be sent twice to the mobile clients. When analyzing the problem on the broker console, we found out that there were two consumers connected to each of the queues one mobile client usually consumes from. One of them belonged to the new connection of the mobile Client, which is fine.
> The other consumer belonged to a session whose connection already failed and was closed at that time. When analyzing the logs, we saw that for these connections, it contained a "Connection failure to ... has been detected" line, but no following "clearing up resources for session ..." log lines for these connections.
>  
> h3. Instance of the issue
>  
> The broken Session is the "7a9292cb-xxx" in the picture. In the logs you can see that the connection failure was detected, but the session was never cleared by the broker (mind the timestamp).
> !duplicated consumers.png!
> {code:java}
> [WARN 2020-07-27 14:33:29,794  Thread-13  org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.255.0.2:54812 has been detected: syscall:read(..) failed: Connection reset by peer [code=GENERIC_EXCEPTION]
> [WARN 2020-07-29 09:31:30,828 Thread-20   org.apache.activemq.artemis.core.client]: AMQ212037: Connection failure to /10.255.0.2:55994 has been detected: AMQ229014: Did not receive data from /10.255.0.2:55994 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
> {code}
>  
> Attached you can find the full [^artemis.log] and our [^broker.xml]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)