You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@qpid.apache.org by "Alan Conway (JIRA)" <ji...@apache.org> on 2010/12/21 21:07:01 UTC

[jira] Created: (QPID-2982) Inconsistent management messages in a cluster

Inconsistent management messages in a cluster
---------------------------------------------

                 Key: QPID-2982
                 URL: https://issues.apache.org/jira/browse/QPID-2982
             Project: Qpid
          Issue Type: Bug
          Components: C++ Clustering
    Affects Versions: 0.8
            Reporter: Alan Conway
            Assignee: Alan Conway
             Fix For: 0.9


Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.

The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.

Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

[jira] Commented: (QPID-2982) Inconsistent management messages in a cluster

Posted by "Alan Conway (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980848#action_12980848 ] 

Alan Conway commented on QPID-2982:
-----------------------------------

I am able to reproduce this, thanks. I'm investigating now.

> Inconsistent management messages in a cluster
> ---------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

[jira] Commented: (QPID-2982) Inconsistent management messages in a cluster

Posted by "Alan Conway (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981925#action_12981925 ] 

Alan Conway commented on QPID-2982:
-----------------------------------

The problem is that the link and bridge objects created by route add are not updated when a new member joins the cluster. Thus the first broker has management objects for link and bridge that the second broker does not. This creates inconsistent management messages when a console like qpid-tool is connected.

> Inconsistent management messages in a cluster
> ---------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

[jira] Updated: (QPID-2982) Inconsistent management messages in a cluster with federation routes

Posted by "Alan Conway (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway updated QPID-2982:
------------------------------

    Summary: Inconsistent management messages in a cluster with federation routes  (was: Inconsistent management messages in a cluster)

> Inconsistent management messages in a cluster with federation routes
> --------------------------------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

[jira] Commented: (QPID-2982) Inconsistent management messages in a cluster

Posted by "Mark Moseley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979045#action_12979045 ] 

Mark Moseley commented on QPID-2982:
------------------------------------

Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with Squeeze corosync (1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0 and 1.1.4, respectively) and still get this error.

As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and B2).

I'm doing the same setup as in 2992:


On all boxes in cluster A and B:
* Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/

On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):
* Start up qpid
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue walmyq1 --durable
* qpid-config bind walmyex1 walmyq1 unix.waltham.cust

On host B1 in cluster B:
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue bosmyq1 --durable
* qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

On cluster A:
* Start other member of cluster, A2
* qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d

On cluster B:
* Start other member of cluster, B2
* qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d

On either cluster:
* Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
* To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
* To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.


If on A1, I run qpid-tool and do something like 'list' enough times that it returns data (instead of the first few times where it returns nothing), on node A2, I get the following very often This is the log on A2 all the way from startup, which is only 10-15 secs earlier:

2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replicating_listener.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replication_exchange.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so
2011-01-07 19:35:30 info Management enabled
2011-01-07 19:35:30 info Registered xml exchange
2011-01-07 19:35:30 info Registered replication exchange
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid
2011-01-07 19:35:31 info > Default files per journal: 8
2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib)
2011-01-07 19:35:31 info > Default number of write cache pages: 32
2011-01-07 19:35:31 info > TPL files per journal: 8
2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib)
2011-01-07 19:35:31 info > TPL number of write cache pages: 64
2011-01-07 19:35:31 notice Initializing CPG
2011-01-07 19:35:31 notice Cluster store state: dirty cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration change: 101.0.0.0:10552 102.0.0.0:13904 
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined: 102.0.0.0:13904 
2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded and journal files pushed down
2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down (saved) into directory /var/lib/qpid/_cluster.bak.0002.
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 info SASL enabled
2011-01-07 19:35:31 notice Listening on TCP port 5672
2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf"
2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links.
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster walclust
2011-01-07 19:35:31 notice Broker running
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer from 101.0.0.0:10552 to 102.0.0.0:13904
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update from 101.0.0.0:10552
2011-01-07 19:35:31 notice Journal "walmyq1": Created
2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-6
2011-01-07 19:35:31 info Queue "bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-7
2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-9
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(unknown) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete, starting catch-up.
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(member) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up.
2011-01-07 19:35:32 info Passivating links
2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:51 error Execution exception: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error 2770 did not occur on member 101.0.0.0:10552: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster walclust
2011-01-07 19:35:51 notice Shut down


In the logs above:
* 'mark' is the qpid SASL username I'm using. It's definitely valid on both nodes, as I can use it in qpid-config just fine in A1 and A2. I also use it in qpid-tool successfully.
* Those "Policy created" messages show up when I initially did 'list' in qpid-tool on A1.
* When 'list' returned data on A1, A2 shut down.

I don't get the above error 100% of the time. Sometimes it'll work consistently and then other times it won't, and won't for a while (and A2 will shut back down almost immediately after rejoining the cluster). The only times that I can consistently get it to crash is if I replicate the broken scenario in JIRA 2992. After doing a fail of B2->B1, A2 will crash as soon as qpid-tool starts returning data (but not before), even if I restart A1 and A2 repeatedly.

These are test clusters, so there's nothing else going on on them. As well, I can perform whatever testing would be useful to the Qpid team.

> Inconsistent management messages in a cluster
> ---------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

[jira] Resolved: (QPID-2982) Inconsistent management messages in a cluster with federation routes

Posted by "Alan Conway (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway resolved QPID-2982.
-------------------------------

    Resolution: Fixed

Fix on trunk r1060568

> Inconsistent management messages in a cluster with federation routes
> --------------------------------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org