You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by stoyan <st...@hotmail.com> on 2011/05/18 12:57:13 UTC

cannot restart failed cluster node

hello
i seem to have had a network failure in my cluster of two nodes - main node
A lived on, while on node B qpid quit.
now, there are two queues (Q1, Q2 with same routing key) and after this
incident broker A kept receiving messages to these queues. 
after some time i tried to restart node B and couldn't - first i tried with
its data-dir untouched, then i removed the data dir contents altogether. 
judging by the qpid logs, the B broker joined the cluster and started
receiving state updates; it read all the messages for queue Q1 and then died
when reading the first message for Q2, the last log message is
'qpid.cluster-update: recv cmd 28: content (267 bytes) <?xml version="1.0"
encoding="ut...'

i managed to start B only when i 'drain'ed the contents of Q2

any hints of what i might be doing wrong when starting up the failed node?

thanks!


stoyan


btw: on node A corosync-cpgtool wrongly thought A and B are still in a
cluster all the time, while on B it properly showed A as the lone node in
the cluster, but thats a different matter

c++ qpid 0.8
corosync 1.3.1
rhel5

the initial network error indicator in corosync.log was 
corosync[8458]:   [TOTEM ] A processor failed, forming new configuration
later followed by
qpidd[8474]: 2011-05-17 21:44:32 critical Multicast error: Cannot mcast to
CPG group QpidCluster: not exist (12)


--
View this message in context: http://apache-qpid-users.2158936.n2.nabble.com/cannot-restart-failed-cluster-node-tp6377307p6377307.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: cannot restart failed cluster node

Posted by Alan Conway <ac...@redhat.com>.

On 05/18/2011 06:57 AM, stoyan wrote:
> hello
> i seem to have had a network failure in my cluster of two nodes - main node
> A lived on, while on node B qpid quit.
> now, there are two queues (Q1, Q2 with same routing key) and after this
> incident broker A kept receiving messages to these queues.
> after some time i tried to restart node B and couldn't - first i tried with
> its data-dir untouched, then i removed the data dir contents altogether.
> judging by the qpid logs, the B broker joined the cluster and started
> receiving state updates; it read all the messages for queue Q1 and then died
> when reading the first message for Q2, the last log message is
> 'qpid.cluster-update: recv cmd 28: content (267 bytes)<?xml version="1.0"
> encoding="ut...'
>
> i managed to start B only when i 'drain'ed the contents of Q2
>
> any hints of what i might be doing wrong when starting up the failed node?

Were there any core files generated? Send me the logs (of both nodes) and I'll 
take a look.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org