You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@activemq.apache.org by "John Heitmann (JIRA)" <ji...@apache.org> on 2006/08/22 09:19:23 UTC

[jira] Created: (AMQ-889) Broker 'loses' messages and leaks resources when handling duplicate connections

Broker 'loses' messages and leaks resources when handling duplicate connections
-------------------------------------------------------------------------------

                 Key: AMQ-889
                 URL: https://issues.apache.org/activemq/browse/AMQ-889
             Project: ActiveMQ
          Issue Type: Bug
          Components: Broker
    Affects Versions: 4.0.2
            Reporter: John Heitmann
         Attachments: doubleConsumer.patch.gz

A client that uses a transport like the failover transport can damage the broker's ability to deliver messages to other clients and ultimately cause broker resource leaks and message loss. I've found 4 issues starting on the client and ending on the broker that could be improved to make the situation a lot better. In this issue I've provided a patch for #3.

1) A failover client stores session metadata commands like ConsumerInfo in a local state tracker. When failover occurs it replays these commands verbatim to the newly connected-to broker. If the failover transport fails back to the original broker it will replay the same commands with the same ids as it already sent the broker. If the failover happens before the broker notices the old connection has gone this can result in bad mojo. Clients should probably regenerate session, consumer, and maybe connection ids.

2) When the broker detects a duplicate ClientId being sent to it it throws an exception saying so, but this does not stop the broker from processing subsequent messages on that connection. The broker should tear down the connection immediately when it sees a client thrashing about.

3) When a broker receives a certain series of ConsumerInfo add and remove commands with the same ConsumerId it leaks resources. One of the resources leaked is the knowledge of lock owners on messages out in prefetch buffers. This means those messages are stuck forever on the broker and can never be retrieved and never be gc()ed. More below.

4) Messages locked and out in prefetch buffers have no broker-side timeout. If a consumer is up, saying hello to the inactivity monitor, but otherwise doing nothing then its messages are locked forever. The broker should have a janitor that redrives stale messages. This seems like the hardest of the 4 to fix, but is one of the most important.

More on #3: One bad sequence of events is:

1) Consumer 'c' connects to the broker over a failover transport. 
2) c subscribes to a queue and addConsumer() gets called. 
3) c fails away and then fails back.
4) c replays ConsumerInfo to the broker. addConsumer() gets called again and overwrites subscription tracking from the first.

After this the broker will eventually get a double remove and there will be noisy JMX complaints etc., but the serious problem has already occurred in step 4. My patch synchronizes the add step so that the  broker is protected. The individual client will still be a bit confused, and there will still be noise when the second remove comes and JMX can't find the consumer to remove, but the resource and message leaks are taken care of.

I'll file issues on the other 3 problems if they sound valid to you and aren't already entered.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/activemq/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (AMQ-889) Broker 'loses' messages and leaks resources when handling duplicate connections

Posted by "james strachan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/activemq/browse/AMQ-889?page=all ]

james strachan resolved AMQ-889.
--------------------------------

    Fix Version/s: 4.1
       Resolution: Fixed

Patch applied - many thanks John! I had to make a minor patch to the MBeans to work with this patch (MBeanTest was failing) as we were being naughty and reusing the same consumerId on each createDurableSubscriber() MBean operation.

You're right that 1, 2, 4 are a concern too - any patches in those areas are most welcome :)

For 1) am thinking that the same IDs shoudl be used (so that then a broker is capable of deducing that a new connection is actually for an already existing client/subscription etc). We want to avoid tearing down and recreating a subscription if possible as for topics this could lead to message loss.

I do think we need some more logic in the broker that if it receives a duplicate client, it will first check to see if the old one is dead as it seems quite common to get duplicate clientID when the client things the socket is dead and reconnects before the broker notices that the client is dead. e.g. we should maybe wait until we try to ping the old client, if that times out, kill the old connection.

For 4) this seem to be a duplicate of AMQ-850 where we should timeout inactive consumers

> Broker 'loses' messages and leaks resources when handling duplicate connections
> -------------------------------------------------------------------------------
>
>                 Key: AMQ-889
>                 URL: https://issues.apache.org/activemq/browse/AMQ-889
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 4.0.2
>            Reporter: John Heitmann
>             Fix For: 4.1
>
>         Attachments: doubleConsumer.patch.gz
>
>
> A client that uses a transport like the failover transport can damage the broker's ability to deliver messages to other clients and ultimately cause broker resource leaks and message loss. I've found 4 issues starting on the client and ending on the broker that could be improved to make the situation a lot better. In this issue I've provided a patch for #3.
> 1) A failover client stores session metadata commands like ConsumerInfo in a local state tracker. When failover occurs it replays these commands verbatim to the newly connected-to broker. If the failover transport fails back to the original broker it will replay the same commands with the same ids as it already sent the broker. If the failover happens before the broker notices the old connection has gone this can result in bad mojo. Clients should probably regenerate session, consumer, and maybe connection ids.
> 2) When the broker detects a duplicate ClientId being sent to it it throws an exception saying so, but this does not stop the broker from processing subsequent messages on that connection. The broker should tear down the connection immediately when it sees a client thrashing about.
> 3) When a broker receives a certain series of ConsumerInfo add and remove commands with the same ConsumerId it leaks resources. One of the resources leaked is the knowledge of lock owners on messages out in prefetch buffers. This means those messages are stuck forever on the broker and can never be retrieved and never be gc()ed. More below.
> 4) Messages locked and out in prefetch buffers have no broker-side timeout. If a consumer is up, saying hello to the inactivity monitor, but otherwise doing nothing then its messages are locked forever. The broker should have a janitor that redrives stale messages. This seems like the hardest of the 4 to fix, but is one of the most important.
> More on #3: One bad sequence of events is:
> 1) Consumer 'c' connects to the broker over a failover transport. 
> 2) c subscribes to a queue and addConsumer() gets called. 
> 3) c fails away and then fails back.
> 4) c replays ConsumerInfo to the broker. addConsumer() gets called again and overwrites subscription tracking from the first.
> After this the broker will eventually get a double remove and there will be noisy JMX complaints etc., but the serious problem has already occurred in step 4. My patch synchronizes the add step so that the  broker is protected. The individual client will still be a bit confused, and there will still be noise when the second remove comes and JMX can't find the consumer to remove, but the resource and message leaks are taken care of.
> I'll file issues on the other 3 problems if they sound valid to you and aren't already entered.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/activemq/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira