You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Alan Conway (JIRA)" <ji...@apache.org> on 2013/01/17 17:34:12 UTC

[jira] [Resolved] (QPID-4201) Destination cluster de-sync when federation link used for a longer time

     [ https://issues.apache.org/jira/browse/QPID-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway resolved QPID-4201.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.19)
                   0.20

This issue affects the old cluster which is no longer part of Qpid for the 0.20 release.
                
> Destination cluster de-sync when federation link used for a longer time
> -----------------------------------------------------------------------
>
>                 Key: QPID-4201
>                 URL: https://issues.apache.org/jira/browse/QPID-4201
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.18
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.20
>
>
> (see also  https://bugzilla.redhat.com/show_bug.cgi?id=836141)
> Description of problem:
> Using queue state replication from a broker (possibly clustered - this does not matter) to a cluster of brokers cause cluster de-sync after a long time:
> 2012-06-28 08:28:30 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed < (4799+0) but only sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Version-Release number of selected component (if applicable):
> every checked 
> How reproducible:
> depending on time, but 10% for default scenario
> Steps to Reproduce:
> (ideally, if possible, rebuild qpid with changing cpp/src/qpid/SessionState.cpp: static const uint32_t SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the reproducer)
> 1) Have source broker (or cluster, this does not matter) and dest.cluster with queue state replication of just one queue from source do dest.cluster.
> 2) On the federation route, setup --ack to some low number (to speedup replication, I used --ack 5).
> 3) Randomly produce and consume messages to the src.broker to the queue to be replicated - ideally, do the enqueues and dequeues as much alternating as possible. Dont know why, but more alternates speeds up the reproducer as well.
> 4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default 64k) of some synchronization messages _from_ the backup cluster (that requires around 100times more messages to be enqueued and dequeued on the replicated queue), there is a probability to hit the bug. Once it was hit on the first attempt (after 2^16 = 64k of such synchronization messages), once after 720896 messages (in 11th "round" / "trial").
>   
> Actual results:
> All brokers in dst.cluster - except the one that has the fed.link established - shut down with log:
> 2012-06-27 15:39:46 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed < (720895+0) but only sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Expected results:
> No such cluster de-sync
> Additional info:
> - interesting fact: I was able to reproduce it using queue state replication - only. Despite the bug is on federation link session, using fed.link without queue state replication did not lead to the bug.
> - the difference comes from the _beginning_ of session communication, per some traces, these AMQP messages sent from dst.cluster to the source are _not_ replayed by (even not multicasted to) the "other dst.brokers" (that have the session / connection as shadow, not local). So these messages are not replayed:
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 0: {MessageSubscribeBody: queue=replication-queue; destination=replication-exchange; accept-mode=0; acquire-mode=0; resume-id resume-ttl=0; arguments={qpid.sync_frequency:F4:int32(100)}; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 1: {MessageFlowBody: destination=replication-exchange; unit=0; value=4294967295; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 2: {MessageFlowBody: destination=replication-exchange; unit=1; value=4294967295; }
> [reply] [-]
> Private
> Comment 1 Pavel Moravec 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org