You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Gary Tully (JIRA)" <ji...@apache.org> on 2016/09/14 10:04:21 UTC

[jira] [Comment Edited] (AMQ-6429) lost messages

    [ https://issues.apache.org/jira/browse/AMQ-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490017#comment-15490017 ] 

Gary Tully edited comment on AMQ-6429 at 9/14/16 10:03 AM:
-----------------------------------------------------------

most likely there are some duplicate sends in the mix. Producers that use failover and had an inflight send when they lost their connection and reconnected. 
The first approach may be to use maxReconnectAttempts=0 in the failover urls such that the application sees these connection failures and can deal with them with new messages that won't be seen as duplicates by the broker.

the other option is some bug in the sync between the cursor and the store. disabling the cursor cache may avoid that scenario.

There are message audits on the cursors, if they detect a duplicate they will redirect it to the DLQ in case there is some error in the duplicate suppression to ensure no message loss. From that perspective the DLQ logging looks ok.
However with 8 duplicates, it may be that the cursor audit needs to be configured with larger limits such that it will suppress more duplicates.
see: PolicyEntry - setMaxProducersToAudit  (the number for max concurrent producers - default to 64) and setMaxAuditDepth (the range to track - a transaction batch size). Most likely setMaxProducersToAudit needs to be larger for your setup.

To fully understand what is going on, we need a scenario that will reproduce and really against the master code base, which will contain all of the latest fixes.



was (Author: gtully):
most likely there are some duplicate sends in the mix. Producers that use failover and had an inflight send when they lost their connection and reconnected. 
The first approach may be to use maxReconnectAttempts=0 in the failover urls such that the application sees these connection failures and can deal with them with new messages that won't be seen as duplicates by the broker.

the other option is some bug in the sync between the cursor and the store. disabling the cursor cache may avoid that scenario.

There are message audits on the cursors, if they detect a duplicate they will redirect it to the DLQ in case there is some error in the duplicate suppression to ensure no message loss. From that persective the DLQ logging looks ok.
However with 8 duplicates, it may be that the cursor audit needs to be configured with larger limits such that it will suppress more duplicates.
see: PolicyEntry - setMaxProducersToAudit  (the number for max concurrent producers - default to 64) and setMaxAuditDepth (the range to track - a transaction batch size). Most likely setMaxProducersToAudit needs to be larger for your setup.

To fully understand what is going on, we need a scenario that will reproduce and really against the master code base, which will contain all of the latest fixes.


> lost messages 
> --------------
>
>                 Key: AMQ-6429
>                 URL: https://issues.apache.org/jira/browse/AMQ-6429
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.11.4
>            Reporter: Asbjørn Aarrestad
>
> We have experienced a problem during somewhat high load (>500 000 messages over 30 minutes to multiple queues), where 2 messages was both delivered and DLQ’ed, 8 messages was delivered twice and 7 messages disappeared (but then upon inspection 6 of them is present in the AMQ database, somehow without AMQ noticing).
> We are running an ActiveMQ 5.11.4 broker in JDBC Master Slave setup with MSSQL server as persistent store, with 2 slaves (i.e. hot standbys).
> There was no master-switching (failover) during the incident.
> We have no indication that there were problems on the MS SQL server at the time.
> There are only two log lines in the ActiveMQ log at the time of the incident:
> 2016-08-29 06:03:54,857 [ActiveMQ BrokerService[svg-amq03_61616] Task-351933] WARN  o.a.a.b.r.c.AbstractStoreCursor - org.apache.activemq.broker.region.cursors.QueueStorePrefetch@24b740dc:stow:AgreementService.private.createOrder,batchResetNeeded=false,size=1,cacheEnabled=false,maxBatchSize:1,hasSpace:true,pendingCachedIds.size:0,lastSyncCachedId:null,lastSyncCachedId-seq:null,lastAsyncCachedId:null,lastAsyncCachedId-seq:null,store=stow:AgreementService.private.createOrder,pendingSize:1 - cursor got duplicate from store ID:svg-agreement01-49217-1471461423933-1:15:1:6980:1 seq: 1233709 
> 2016-08-29 06:03:54,857 [ActiveMQ BrokerService[svg-amq03_61616] Task-351933] WARN  o.a.activemq.broker.region.Queue - duplicate message from store ID:svg-agreement01-49217-1471461423933-1:15:1:6980:1, redirecting for dlq processing 
> 2016-08-29 06:03:54,920 [ActiveMQ BrokerService[svg-amq03_61616] Task-351926] WARN  o.a.a.b.r.c.AbstractStoreCursor - org.apache.activemq.broker.region.cursors.QueueStorePrefetch@24b740dc:stow:AgreementService.private.createOrder,batchResetNeeded=false,size=2,cacheEnabled=false,maxBatchSize:2,hasSpace:true,pendingCachedIds.size:0,lastSyncCachedId:null,lastSyncCachedId-seq:null,lastAsyncCachedId:null,lastAsyncCachedId-seq:null,store=stow:AgreementService.private.createOrder,pendingSize:1 - cursor got duplicate from store ID:svg-agreement01-49217-1471461423933-1:15:1:6981:1 seq: 1233746 
> 2016-08-29 06:03:54,920 [ActiveMQ BrokerService[svg-amq03_61616] Task-351926] WARN  o.a.activemq.broker.region.Queue - duplicate message from store ID:svg-agreement01-49217-1471461423933-1:15:1:6981:1, redirecting for dlq processing 
> The two DLQ’ed messages from the log lines were both delivered correctly, and then also DLQ’ed.
> At the same time we got 8 other messages delivered twice, and 7 messages looked like they were gone. When querying the AMQ database, 6 of the 7 lost messages are present in the database, but not present when querying MBeans for the queues – leaving 1 message without a trace. (We know it was sent, due to log lines from the application).
> All of this happened at the same time (during same second), and all of the problematic messages in question were on the same queue.
> Any idea why this happenend, and how to avoid it the future?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)