You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Ilya Haykinson <ha...@gmail.com> on 2011/10/06 07:43:30 UTC

Corosync crashes after a few minutes

I've managed to get qpid's c++ broker (0.12, built from trunk) working in a
clustered configuration on two machines, and it's easy to illustrate that
the cluster is functional (i.e. failover is working, with redelivery etc).

However, invariably, after some minutes (rarely a consistent number, but
usually 2 to 10) one of the brokers leaves the cluster due to a corosync
crash. The only indication of what went wrong is in the corosync log, and
it's the line:

[TOTEM ] totemsrp.c:3481 FAILED TO RECEIVE

Has anyone experienced this issue? If there was no communication between
brokers at all then I'd suspect the network.

I've tried building corosync 1.4.1 on CentOS 5.6 and using 1.2.0 that comes
with Ubuntu 10.04, to no avail.

-ilya

Re: Corosync crashes after a few minutes

Posted by Alan Conway <ac...@redhat.com>.
On 11/12/2011 02:31 AM, Ilya Haykinson wrote:
> Managed to solve the problem by getting multicast enabled on the router.
>
> What tripped me up was that the brokers would connect (making it seem like
> multicast was working), only to lose connection. I would have imagined that
> with multicast disabled on the router there wouldn't be any way for brokers
> to discover each other. Do they fall back to using broadcast then?

No, clustered brokers  need multicast to talk to each other. There is a TCP 
connection to new brokers joining to give them an initial 'snapshot' of the 
state, but even that needs multicast to find out the IP address to connect to.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


RE: Corosync crashes after a few minutes

Posted by Gordon Irving <go...@sophos.com>.
If the router does not talk multicast or have igmp snooping enabled then any multicast traffic received is commonly treated as broadcast as a fail open strategy.  It possibly depends on the switch though, all that I have encountered if incorrectly configured blat multicast over the entire network.

It is possible to configure totem(the corosync protocol) to use unicast udp traffic instead of multicast as multicast can be hard to get right in many environments (large enterprise with many different vendor switch equipment of varying ages over say a lab running off a single managed switch).

HTH
Gordon

> -----Original Message-----
> From: Alan Conway [mailto:aconway@redhat.com]
> Sent: 21 November 2011 07:06
> To: users@qpid.apache.org
> Cc: Ilya Haykinson
> Subject: Re: Corosync crashes after a few minutes
>
> On 11/12/2011 02:31 AM, Ilya Haykinson wrote:
> > Managed to solve the problem by getting multicast enabled on the
> router.
> >
> > What tripped me up was that the brokers would connect (making it seem
> like
> > multicast was working), only to lose connection. I would have
> imagined that
> > with multicast disabled on the router there wouldn't be any way for
> brokers
> > to discover each other. Do they fall back to using broadcast then?
>
> No, clustered brokers  need multicast to talk to each other. There is a
> TCP
> connection to new brokers joining to give them an initial 'snapshot' of
> the
> state, but even that needs multicast to find out the IP address to
> connect to.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org


Sophos Limited, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 991 2418 08.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: Corosync crashes after a few minutes

Posted by Alan Conway <ac...@redhat.com>.
On 11/12/2011 02:31 AM, Ilya Haykinson wrote:
> Managed to solve the problem by getting multicast enabled on the router.
>
> What tripped me up was that the brokers would connect (making it seem like
> multicast was working), only to lose connection. I would have imagined that
> with multicast disabled on the router there wouldn't be any way for brokers
> to discover each other. Do they fall back to using broadcast then?

No, clustered brokers  need multicast to talk to each other. There is a TCP 
connection to new brokers joining to give them an initial 'snapshot' of the 
state, but even that needs multicast to find out the IP address to connect to.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: Corosync crashes after a few minutes

Posted by Ilya Haykinson <ha...@gmail.com>.
Managed to solve the problem by getting multicast enabled on the router.

What tripped me up was that the brokers would connect (making it seem like
multicast was working), only to lose connection. I would have imagined that
with multicast disabled on the router there wouldn't be any way for brokers
to discover each other. Do they fall back to using broadcast then?

-ilya



On Thu, Nov 10, 2011 at 4:26 AM, serbaut <se...@gmail.com> wrote:

> Have you checked that you have a router on your corsync subnet? Otherwise
> the
> group membership will not be renewed and multicast will stop working within
> a couple of minutes.
>
> --
> View this message in context:
> http://apache-qpid-users.2158936.n2.nabble.com/Corosync-crashes-after-a-few-minutes-tp6864929p6981447.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
>

Re: Corosync crashes after a few minutes

Posted by serbaut <se...@gmail.com>.
Have you checked that you have a router on your corsync subnet? Otherwise the
group membership will not be renewed and multicast will stop working within
a couple of minutes.

--
View this message in context: http://apache-qpid-users.2158936.n2.nabble.com/Corosync-crashes-after-a-few-minutes-tp6864929p6981447.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


RE: Corosync crashes after a few minutes

Posted by Gordon Irving <go...@sophos.com>.
> -----Original Message-----
> From: Alan Conway [mailto:aconway@redhat.com]
> Sent: 06 October 2011 06:01
> To: users@qpid.apache.org
> Subject: Re: Corosync crashes after a few minutes
>
> On 10/06/2011 01:43 AM, Ilya Haykinson wrote:
> > I've managed to get qpid's c++ broker (0.12, built from trunk)
> working in a
> > clustered configuration on two machines, and it's easy to illustrate
> that
> > the cluster is functional (i.e. failover is working, with redelivery
> etc).
> >
> > However, invariably, after some minutes (rarely a consistent number,
> but
> > usually 2 to 10) one of the brokers leaves the cluster due to a
> corosync
> > crash. The only indication of what went wrong is in the corosync log,
> and
> > it's the line:
> >
> > [TOTEM ] totemsrp.c:3481 FAILED TO RECEIVE
> >
> I've  never seen that. Perhaps you might get something from the mailing
> list at:
> http://www.corosync.org/doku.php?id=support
>

I had issues with corosync/totem which were down to using the default multicast address/ports in the lab environment. Other people had setup clusters again using the defaults and cross talk would crash corosync.

Cheers
Gordon

Sophos Limited, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 991 2418 08.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: Corosync crashes after a few minutes

Posted by Alan Conway <ac...@redhat.com>.
On 10/06/2011 01:43 AM, Ilya Haykinson wrote:
> I've managed to get qpid's c++ broker (0.12, built from trunk) working in a
> clustered configuration on two machines, and it's easy to illustrate that
> the cluster is functional (i.e. failover is working, with redelivery etc).
>
> However, invariably, after some minutes (rarely a consistent number, but
> usually 2 to 10) one of the brokers leaves the cluster due to a corosync
> crash. The only indication of what went wrong is in the corosync log, and
> it's the line:
>
> [TOTEM ] totemsrp.c:3481 FAILED TO RECEIVE
>
I've  never seen that. Perhaps you might get something from the mailing  list at:
http://www.corosync.org/doku.php?id=support

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org