You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Zhemzhitsky Sergey <Se...@troika.ru> on 2011/08/17 15:14:38 UTC

QPID 0.8 Cluster. Node failure.

Hi there,

I have a two-node cluster which is built from qpid 0.8 and corosync 1.2.3.

From time to time one node of the cluster stops running.
However there are no anything special in the log files of the QPID process.

2011-08-16 09:18:33 warning JournalInactive:TplStore timer woken up 192ms late, overrunning by 192ms [taking 6297ns]
2011-08-16 09:18:33 warning JournalInactive:smx.stdint.finbroker timer woken up 192ms late
2011-08-16 12:12:03 warning JournalInactive:TplStore timer callback overran by 13ms [taking 6107ns]
2011-08-17 12:00:26 warning JournalInactive:TplStore timer callback overran by 3ms [taking 6848ns]
2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) configuration change: 10.20.3.125:1918
2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) Members left: 10.20.3.120:3728
2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY)Sole member of cluster, marking store clean.
2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) last broker standing, update queue policies
2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) configuration change:
2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) Members left: 10.20.3.125:1918

At the same time in the log files of corosync there is a string "A processor failed, forming new configuration"

Aug 08 11:55:13 corosync [CPG   ] downlist received left_list: 1
Aug 08 11:55:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
Aug 08 11:55:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Aug 17 16:35:12 corosync [TOTEM ] A processor failed, forming new configuration.
Aug 17 16:35:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 17 16:35:13 corosync [CPG   ] downlist received left_list: 1
Aug 17 16:35:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
Aug 17 16:35:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Aug 17 16:35:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 0
Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 1
Aug 17 16:35:14 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.120)
Aug 17 16:35:14 corosync [MAIN  ] Completed service synchronization, ready to provide service.

As a rule one node of the QPID cluster becomes unavailable just after "A processor failed, forming new configuration" occurs in the log file of corosync.

Could you please help me to determine why such a behavior occurs and how to fix this issue?


Best Regards,
Sergey Zhemzhitsky
Information Technology Division
Troika Dialog, 4, Romanov lane, Moscow 125009, Russia
Phone. +7 495 2580500 ext. 1246


_______________________________________________________

The information contained in this message may be privileged and conf idential and protected from disclosure. If you are not the original intended recipient, you are hereby notified that any review, retransmission, dissemination, or other use of, or taking of any action in reliance upon, this information is prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and delete it from your computer. Thank you for your cooperation. Troika Dialog, Russia. 
If you need assistance please contact our Contact Center  (+7495) 258 0500 or go to www.troika.ru/eng/Contacts/system.wbp

RE: QPID 0.8 Cluster. Node failure.

Posted by Zhemzhitsky Sergey <Se...@troika.ru>.

Hi Alan,

Unfortunately core files were disabled, so I have to enable them and wait for the next failure.

Best Regards,
Sergey Zhemzhitsky

-----Original Message-----
From: Alan Conway [mailto:aconway@redhat.com] 
Sent: Wednesday, August 17, 2011 5:21 PM
To: users@qpid.apache.org
Cc: Zhemzhitsky Sergey
Subject: Re: QPID 0.8 Cluster. Node failure.

On 08/17/2011 09:14 AM, Zhemzhitsky Sergey wrote:
> Hi there,
>
> I have a two-node cluster which is built from qpid 0.8 and corosync 1.2.3.
>
>  From time to time one node of the cluster stops running.
> However there are no anything special in the log files of the QPID process.
>
> 2011-08-16 09:18:33 warning JournalInactive:TplStore timer woken up 
> 192ms late, overrunning by 192ms [taking 6297ns]
> 2011-08-16 09:18:33 warning JournalInactive:smx.stdint.finbroker timer 
> woken up 192ms late
> 2011-08-16 12:12:03 warning JournalInactive:TplStore timer callback 
> overran by 13ms [taking 6107ns]
> 2011-08-17 12:00:26 warning JournalInactive:TplStore timer callback 
> overran by 3ms [taking 6848ns]
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) 
> configuration change: 10.20.3.125:1918
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) Members 
> left: 10.20.3.120:3728
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY)Sole member of cluster, marking store clean.
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) last broker 
> standing, update queue policies
> 2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) configuration change:
> 2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) Members 
> left: 10.20.3.125:1918
>
> At the same time in the log files of corosync there is a string "A processor failed, forming new configuration"
>
> Aug 08 11:55:13 corosync [CPG   ] downlist received left_list: 1
> Aug 08 11:55:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
> Aug 08 11:55:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Aug 17 16:35:12 corosync [TOTEM ] A processor failed, forming new configuration.
> Aug 17 16:35:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Aug 17 16:35:13 corosync [CPG   ] downlist received left_list: 1
> Aug 17 16:35:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
> Aug 17 16:35:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Aug 17 16:35:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 0
> Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 1
> Aug 17 16:35:14 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.120)
> Aug 17 16:35:14 corosync [MAIN  ] Completed service synchronization, ready to provide service.
>
> As a rule one node of the QPID cluster becomes unavailable just after "A processor failed, forming new configuration" occurs in the log file of corosync.

I suspect the qpid broker crashes just *before* the CPG message. The CPG message is informing you that the process failed.

Does the failed process leave a core file? Check that you are allowing core
files: that ulimit -c says "unlimited" in the context that is starting the broker. You can verify that you are allowing core files by doing "kill -abrt" on a broker, that should force a core file.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


_______________________________________________________

The information contained in this message may be privileged and conf idential and protected from disclosure. If you are not the original intended recipient, you are hereby notified that any review, retransmission, dissemination, or other use of, or taking of any action in reliance upon, this information is prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and delete it from your computer. Thank you for your cooperation. Troika Dialog, Russia. 
If you need assistance please contact our Contact Center  (+7495) 258 0500 or go to www.troika.ru/eng/Contacts/system.wbp  



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: QPID 0.8 Cluster. Node failure.

Posted by Alan Conway <ac...@redhat.com>.

On 08/17/2011 09:14 AM, Zhemzhitsky Sergey wrote:
> Hi there,
>
> I have a two-node cluster which is built from qpid 0.8 and corosync 1.2.3.
>
>  From time to time one node of the cluster stops running.
> However there are no anything special in the log files of the QPID process.
>
> 2011-08-16 09:18:33 warning JournalInactive:TplStore timer woken up 192ms late, overrunning by 192ms [taking 6297ns]
> 2011-08-16 09:18:33 warning JournalInactive:smx.stdint.finbroker timer woken up 192ms late
> 2011-08-16 12:12:03 warning JournalInactive:TplStore timer callback overran by 13ms [taking 6107ns]
> 2011-08-17 12:00:26 warning JournalInactive:TplStore timer callback overran by 3ms [taking 6848ns]
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) configuration change: 10.20.3.125:1918
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) Members left: 10.20.3.120:3728
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY)Sole member of cluster, marking store clean.
> 2011-08-17 16:35:13 notice cluster(10.20.3.125:1918 READY) last broker standing, update queue policies
> 2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) configuration change:
> 2011-08-17 16:35:14 notice cluster(10.20.3.125:1918 READY) Members left: 10.20.3.125:1918
>
> At the same time in the log files of corosync there is a string "A processor failed, forming new configuration"
>
> Aug 08 11:55:13 corosync [CPG   ] downlist received left_list: 1
> Aug 08 11:55:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
> Aug 08 11:55:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Aug 17 16:35:12 corosync [TOTEM ] A processor failed, forming new configuration.
> Aug 17 16:35:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Aug 17 16:35:13 corosync [CPG   ] downlist received left_list: 1
> Aug 17 16:35:13 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.125)
> Aug 17 16:35:13 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Aug 17 16:35:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 0
> Aug 17 16:35:14 corosync [CPG   ] downlist received left_list: 1
> Aug 17 16:35:14 corosync [CPG   ] chosen downlist from node r(0) ip(10.20.3.120)
> Aug 17 16:35:14 corosync [MAIN  ] Completed service synchronization, ready to provide service.
>
> As a rule one node of the QPID cluster becomes unavailable just after "A processor failed, forming new configuration" occurs in the log file of corosync.

I suspect the qpid broker crashes just *before* the CPG message. The CPG message 
is informing you that the process failed.

Does the failed process leave a core file? Check that you are allowing core 
files: that ulimit -c says "unlimited" in the context that is starting the 
broker. You can verify that you are allowing core files by doing "kill -abrt" on 
a broker, that should force a core file.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org