You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by fijany <ja...@emblacom.com> on 2011/09/19 13:05:05 UTC

Cannot restart a failed cluster node (catch-up connection closed prematurely)

Hello,

i have a following scenario causing a failure while trying to restart a
failed cluster node:

1) start a cluster on two nodes N1 and N2 (172.16.133.123/172.16.133.120)
2) start consumer C for queue Q
3) start producer P for queue Q sending one text message / sec
4) confirm with tcpdump that N1 is retrieving all of the traffic
5) shut down node N1 with 'qpidd --quit'
6) confirm with tcpdump that N2 is retrieving all of the traffic (successful
failover)
7) restart node N1 with 'qpidd'
8) check the qpidd.log with the error catch-up connection closed prematurely

Any ideas what's going on?

qpidd.conf:
cluster-mechanism=ANONYMOUS
cluster-name=MYCLUSTER
log-to-file=/home/qpid/qpid.log
daemon=yes
no-data-dir=yes
auth=no


qpidd.log (N1)
2011-09-19 13:58:35 notice Initializing CPG
2011-09-19 13:58:35 notice cluster(172.16.133.123:18918 PRE_INIT)
configuration change: 172.16.133.123:18918 
2011-09-19 13:58:35 notice cluster(172.16.133.123:18918 PRE_INIT) Members
joined: 172.16.133.123:18918 
2011-09-19 13:58:35 notice SASL disabled: No Authentication Performed
2011-09-19 13:58:35 notice Listening on TCP port 5672
2011-09-19 13:58:35 notice cluster(172.16.133.123:18918 INIT) cluster-uuid =
7ab02e1b-67dd-4fed-b176-79b567ab699f
2011-09-19 13:58:35 notice cluster(172.16.133.123:18918 READY) joined
cluster EMS_CLUSTER
2011-09-19 13:58:35 notice Broker running
2011-09-19 13:58:41 notice cluster(172.16.133.123:18918 READY) configuration
change: 172.16.133.120:29504 172.16.133.123:18918 
2011-09-19 13:58:41 notice cluster(172.16.133.123:18918 READY) Members
joined: 172.16.133.120:29504 
2011-09-19 13:58:41 notice cluster(172.16.133.123:18918 UPDATER) sending
update to 172.16.133.120:29504 at amqp:tcp:172.16.133.120:5672
2011-09-19 13:58:41 warning Broker closed connection: 200, OK
2011-09-19 13:58:41 notice cluster(172.16.133.123:18918 UPDATER) update sent
2011-09-19 14:00:40 notice Shut down
2011-09-19 14:00:43 notice Initializing CPG
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 PRE_INIT)
configuration change: 172.16.133.120:29504 172.16.133.123:19037 
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 PRE_INIT) Members
joined: 172.16.133.123:19037 
2011-09-19 14:00:43 notice SASL disabled: No Authentication Performed
2011-09-19 14:00:43 notice Listening on TCP port 5672
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 INIT) cluster-uuid =
7ab02e1b-67dd-4fed-b176-79b567ab699f
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 JOINER) joining
cluster MYCLUSTER
2011-09-19 14:00:43 notice Broker running
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 UPDATEE) receiving
update from 172.16.133.120:29504
2011-09-19 14:00:43 error deliveryRecord no update message
(qpid/cluster/Connection.cpp:537)
2011-09-19 14:00:43 critical cluster(172.16.133.123:19037 UPDATEE) catch-up
connection closed prematurely
172.16.133.120:5672-172.16.136.143:53170(172.16.133.123:19037-4
local,catchup)
2011-09-19 14:00:43 notice cluster(172.16.133.123:19037 LEFT) leaving
cluster MYCLUSTER
2011-09-19 14:00:43 notice Shut down



--
View this message in context: http://apache-qpid-users.2158936.n2.nabble.com/Cannot-restart-a-failed-cluster-node-catch-up-connection-closed-prematurely-tp6807962p6807962.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cannot restart a failed cluster node (catch-up connection closed prematurely)

Posted by Jaakko Nyman <ja...@emblacom.com>.

I have added the following jira issue with Java test code to repeat the 
problem:

https://issues.apache.org/jira/browse/QPID-3495

I tried to be as thorough as possible but i'll try to help if there is 
anything additional you need.



19.09.2011 16:50, Alan Conway kirjoitti:
> On 09/19/2011 07:36 AM, fijany wrote:
>> In addition:
>>
>> - this has been tested on both qpid 0.12 and the mrg 1.0 with the same
>> results
>> - we are using QPid Java JMS client
>> - restart seems to work if no messages are produced/sent during the node
>> downtime
>>
>> Does this mean that all of the producers would need to be stopped 
>> during the
>> time nodes are being restarted on a cluster? This would be quite 
>> impractical
>> for us.
>>
>
> You shouldn't need to stop producers for failover to work. Please 
> create a JIRA for this, with detailed instructions on how to 
> reproduce. You can assign it to me.
>
> Thanks,
> Alan.


-- 
Jaakko Nyman | Software Designer
Mobile: +358407088788


EmblaCom Oy
P.O. Box 169 | FI-00211 | Helsinki | Finland
Visiting address Vattuniemenkuja 5
Main: +358 424 10101

Re: Cannot restart a failed cluster node (catch-up connection closed prematurely)

Posted by Alan Conway <ac...@redhat.com>.

On 09/19/2011 07:36 AM, fijany wrote:
> In addition:
>
> - this has been tested on both qpid 0.12 and the mrg 1.0 with the same
> results
> - we are using QPid Java JMS client
> - restart seems to work if no messages are produced/sent during the node
> downtime
>
> Does this mean that all of the producers would need to be stopped during the
> time nodes are being restarted on a cluster? This would be quite impractical
> for us.
>

You shouldn't need to stop producers for failover to work. Please create a JIRA 
for this, with detailed instructions on how to reproduce. You can assign it to me.

Thanks,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cannot restart a failed cluster node (catch-up connection closed prematurely)

Posted by fijany <ja...@emblacom.com>.

In addition:

- this has been tested on both qpid 0.12 and the mrg 1.0 with the same
results
- we are using QPid Java JMS client
- restart seems to work if no messages are produced/sent during the node
downtime

Does this mean that all of the producers would need to be stopped during the
time nodes are being restarted on a cluster? This would be quite impractical
for us.

--
View this message in context: http://apache-qpid-users.2158936.n2.nabble.com/Cannot-restart-a-failed-cluster-node-catch-up-connection-closed-prematurely-tp6807962p6808021.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org