You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Shan Wang <Sh...@igindex.co.uk> on 2009/11/03 12:13:30 UTC

An ill borker brings down the whole cluster

Hi All,

We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:

2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?

Regards,
Shan

________________________________
The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Ok, I understand the exclusive queue now, but what do you mean by "aquired messages not accepted"? So if I don't use exclusive queue and heartbeat, does my client still have risk to be locked up? If so, under what condition?

Also does anyone know what does the following error mean?

2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 17:07
To: Shan Wang
Cc: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

Shan Wang wrote:
> So even if heartbeat isn't set, the client can still failover to another broker after some default timeout?
>
> Don't quite understand how could broker release resources based on heartbeat, isn't heartbeat used by client to detect the outage of a broker? Does broker also listen a heartbeat sent by client?
>

if a session has an exclusive queue or aquired messages not accepted,
these will remain locked till teh connection times our or 2x heartbeat
before the are release for
other clients

> A more important question to me is, will two qpid-tool conflict while each monitoring the same queue name on one broker of the cluster?
>
they will not conflict

Carl

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

Shan Wang wrote:
> So even if heartbeat isn't set, the client can still failover to another broker after some default timeout?
>
> Don't quite understand how could broker release resources based on heartbeat, isn't heartbeat used by client to detect the outage of a broker? Does broker also listen a heartbeat sent by client?
>   


if a session has an exclusive queue or aquired messages not accepted, 
these will remain locked till teh connection times our or 2x heartbeat 
before the are release for
other clients


> A more important question to me is, will two qpid-tool conflict while each monitoring the same queue name on one broker of the cluster?
>   
they will not conflict

Carl

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

So even if heartbeat isn't set, the client can still failover to another broker after some default timeout?

Don't quite understand how could broker release resources based on heartbeat, isn't heartbeat used by client to detect the outage of a broker? Does broker also listen a heartbeat sent by client?

A more important question to me is, will two qpid-tool conflict while each monitoring the same queue name on one broker of the cluster?

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 16:34
To: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

The main reason for setting heartbeat is to fail the client from one
node of the cluster to another faster + have the broker 'release' any
resources fro re-use associated with that session, for example
'exclusive' queues.

In connection settings:

    uint16_t heartbeat;
    /**
     * The maximum number of channels that the client will request for
     * use on this connection.
     */

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.1/html/MRG_Messaging_Qpid_C++_API_Reference/a00037.html
or
http://qpid.apache.org/docs/api/cpp/html/a00226.html

Carl.

Shan Wang wrote:
> Hmm...so what will happen if I don't set it? In the c++ API doc from redhat, there's nowhere mentioning this heartbeat, I only get this from the qpid FAQ. I tested the cluster by cleanly shutdown one broker while producing a lot of messages to a queue, the other broker still works fine and no messages were lost.
>
>
>
> -----Original Message-----
> From: Gordon Sim [mailto:gsim@redhat.com]
> Sent: 03 November 2009 15:32
> To: dev@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> On 11/03/2009 02:43 PM, Shan Wang wrote:
>
>> I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine.
>>
>
> By default heartbeats are not used for the c++ client. You must set the
> interval in ConnectionSettings to turn them on.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

The main reason for setting heartbeat is to fail the client from one 
node of the cluster to another faster + have the broker 'release' any 
resources fro re-use associated with that session, for example 
'exclusive' queues.

In connection settings:

    uint16_t heartbeat;
    /**
     * The maximum number of channels that the client will request for
     * use on this connection.
     */

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.1/html/MRG_Messaging_Qpid_C++_API_Reference/a00037.html
or
http://qpid.apache.org/docs/api/cpp/html/a00226.html
 
Carl.



Shan Wang wrote:
> Hmm...so what will happen if I don't set it? In the c++ API doc from redhat, there's nowhere mentioning this heartbeat, I only get this from the qpid FAQ. I tested the cluster by cleanly shutdown one broker while producing a lot of messages to a queue, the other broker still works fine and no messages were lost.
>
>
>
> -----Original Message-----
> From: Gordon Sim [mailto:gsim@redhat.com]
> Sent: 03 November 2009 15:32
> To: dev@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> On 11/03/2009 02:43 PM, Shan Wang wrote:
>   
>> I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine.
>>     
>
> By default heartbeats are not used for the c++ client. You must set the
> interval in ConnectionSettings to turn them on.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Hmm...so what will happen if I don't set it? In the c++ API doc from redhat, there's nowhere mentioning this heartbeat, I only get this from the qpid FAQ. I tested the cluster by cleanly shutdown one broker while producing a lot of messages to a queue, the other broker still works fine and no messages were lost.

-----Original Message-----
From: Gordon Sim [mailto:gsim@redhat.com]
Sent: 03 November 2009 15:32
To: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

On 11/03/2009 02:43 PM, Shan Wang wrote:
> I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine.

By default heartbeats are not used for the c++ client. You must set the
interval in ConnectionSettings to turn them on.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Gordon Sim <gs...@redhat.com>.

On 11/03/2009 02:43 PM, Shan Wang wrote:
> I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine.

By default heartbeats are not used for the c++ client. You must set the 
interval in ConnectionSettings to turn them on.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Hi Carl,

Does that mean if there is a un-responsive broker in the cluster, all the other brokers will have to wait for it? I was getting the impression that if a broker in a cluster is not behaving normally, client should failover to another broker automatically.

In my case, I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine. If the heartbeat works fine, then the ill broker must have passed the "heartbeat check", ie, it sends out heartbeat normally, may even receive messages from producer client normally, but somehow some of its threads was dead-lock so that it couldn't respond to consumer request and admin requests. Except the plug-in, is there anything else I can do to remove un-responsive brokers?

Unfortunately the channel error I sent in the original mail is the only error log I have(log level was set to warning+). There's one factor may contribute to the problem: We run qpid-tool on both of the two brokers every 5 minutes to collect stats. Because of the cluster, the two qpid-tool processes are actually monitoring the same queues. Is it possible that the two qpid-tool had conflict problem there? Is there any known problem for this?

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 13:53
To: users@qpid.apache.org
Cc: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

I don't have enough info to comment on the root cause, Maybe Alan can
based on the log snippet, however there is a pulg-in module that can be
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of
teh cluster will continue to run, but AIS will cache for the node that
is stopped. It is required that node be evicted at some point if it does
not get a sig-cont after a period of time. The watchdog plugin does this
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in
your example below.  The second part is to understand why it was
unresponsive.

Carl.

Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Hi Carl,

Does that mean if there is a un-responsive broker in the cluster, all the other brokers will have to wait for it? I was getting the impression that if a broker in a cluster is not behaving normally, client should failover to another broker automatically.

In my case, I haven't set any heartbeat in ConnectionSettings explicitly, the cluster worked most of the time, so I guess the default heartbeat interval is fine. If the heartbeat works fine, then the ill broker must have passed the "heartbeat check", ie, it sends out heartbeat normally, may even receive messages from producer client normally, but somehow some of its threads was dead-lock so that it couldn't respond to consumer request and admin requests. Except the plug-in, is there anything else I can do to remove un-responsive brokers?

Unfortunately the channel error I sent in the original mail is the only error log I have(log level was set to warning+). There's one factor may contribute to the problem: We run qpid-tool on both of the two brokers every 5 minutes to collect stats. Because of the cluster, the two qpid-tool processes are actually monitoring the same queues. Is it possible that the two qpid-tool had conflict problem there? Is there any known problem for this?

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 13:53
To: users@qpid.apache.org
Cc: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

I don't have enough info to comment on the root cause, Maybe Alan can
based on the log snippet, however there is a pulg-in module that can be
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of
teh cluster will continue to run, but AIS will cache for the node that
is stopped. It is required that node be evicted at some point if it does
not get a sig-cont after a period of time. The watchdog plugin does this
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in
your example below.  The second part is to understand why it was
unresponsive.

Carl.

Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

I don't have enough info to comment on the root cause, Maybe Alan can 
based on the log snippet, however there is a pulg-in module that can be 
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster 
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of 
teh cluster will continue to run, but AIS will cache for the node that 
is stopped. It is required that node be evicted at some point if it does 
not get a sig-cont after a period of time. The watchdog plugin does this 
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in 
your example below.  The second part is to understand why it was 
unresponsive.

Carl.


Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>   


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/04/2009 10:36 AM, Shan Wang wrote:
> Hi Alan,
>
> The whole cluster lost response, but qpid-tool is still able to connect to broker2 but not broker1, based on that I suppose it's broker1 became ill, and restart of broker1 cured the whole cluster.
>
> The full log of broker1 from 31-OCT is attached. Now we have turned log levels to info+ and will apply --log-enable=debug+:cluster later.
>
> Before hanging, there are many clients sending messages to the cluster, I don't know the exact number of clients but usually between 150-200, the update rate was about 5-10 MB/minute. The receiver was receiving messages ok but suddenly stopped working. I believe the receiver stopped working before sender, because after things back to normal, we can see very old messages in the receiver's log, but not relative recent messages commited after the problem.
>
> The affected system carries pretty serious tasks so I can't play with it as I wish, nor did I try the sender/receiver example. But as my latest email said, the problem re-occurred this morning, this time with broker2.
>
>
> The given link could be a similar issue, but the question is what caused errors in cluster?

Sorry for taking so long to get back to  you.

I think you're seeing a combination of 2 issues:

https://bugzilla.redhat.com/show_bug.cgi?id=529489 could cause the "already 
attached" error if you have a lot of sessions.

https://bugzilla.redhat.com/show_bug.cgi?id=514487 could cause the cluster to 
hang if you get "already attached" errors simultaneously on 2 different cluster 
members.

Both of these are fixed for the next release


> -----Original Message-----
> From: Alan Conway [mailto:aconway@redhat.com]
> Sent: 04 November 2009 14:10
> To: dev@qpid.apache.org
> Cc: cctrieloff@redhat.com; users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> On 11/03/2009 04:41 PM, Shan Wang wrote:
>> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
>> Cluster side we are using 0.5.752581-26.el5.
>>
>> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> I'd like to try an reproduce your issue, need some more details:
>
>>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>>> Hi All,
>>>>
>>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>>> found one broker suddenly lost response to all connected clients and
>>>> admin tools. All producer and consumer clients are still connected
>>>> but failed to consume any messages from the queue.
>
> Just to clarify: did only one broker become unresponsive or did both of them
> become unresponsive?
>
> The command line
>>>> admin tool failed with a time out error. The only error message we
>>>> found is in the log of broker 1, which said this:
>>>>
>>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> Do you still have the full logs of both brokers at the time they were
> unresponsive? Can you run the broker with
>
>    --log-enable=notify+ --log-enable=debug+:cluster
>
> for future runs so we can hopefully get a bit more information about what the
> cluster is doing at the time of the hang?
>
> What are your clients doing? Can you reproduce the problem using the sender and
> receiver examples?
>
> How many clients are running against each broker?
>
> How easy is it to reproduce the problem?
>
>>>>
>>>> After only restarted broker 1, everything starts to work again. So
>>>> surprisingly it seems when one of the brokers in the cluster suffered
>>>> a problem, the whole cluster just stalled, at least from the
>>>> consumer's point of view ( I can't be sure if the producer was
>>>> working during the down time, after back to normal, consumer did
>>>> receive messages sent sometime ago ). Consumer program uses
>>>> FailoverManager and AsyncSession, basically not far from the failover
>>>> example in the qpid developing doc. So can anyone please tell me what
>>>> the above error message means and have we seen similar problems to
>>>> the cluster before?
>
> Yes I've seen similar problems before, but believe them all to be fixed at this
> point on trunk. It might be the issue fixed by
>
> http://svn.apache.org/viewvc?view=revision&revision=799687
>
> If I can reproduce the problem then I can verify if it is fixed on trunk.
>
> Cheers,
> Alan.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>
>
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/04/2009 10:36 AM, Shan Wang wrote:
> Hi Alan,
>
> The whole cluster lost response, but qpid-tool is still able to connect to broker2 but not broker1, based on that I suppose it's broker1 became ill, and restart of broker1 cured the whole cluster.
>
> The full log of broker1 from 31-OCT is attached. Now we have turned log levels to info+ and will apply --log-enable=debug+:cluster later.
>
> Before hanging, there are many clients sending messages to the cluster, I don't know the exact number of clients but usually between 150-200, the update rate was about 5-10 MB/minute. The receiver was receiving messages ok but suddenly stopped working. I believe the receiver stopped working before sender, because after things back to normal, we can see very old messages in the receiver's log, but not relative recent messages commited after the problem.
>
> The affected system carries pretty serious tasks so I can't play with it as I wish, nor did I try the sender/receiver example. But as my latest email said, the problem re-occurred this morning, this time with broker2.
>
>
> The given link could be a similar issue, but the question is what caused errors in cluster?

Sorry for taking so long to get back to  you.

I think you're seeing a combination of 2 issues:

https://bugzilla.redhat.com/show_bug.cgi?id=529489 could cause the "already 
attached" error if you have a lot of sessions.

https://bugzilla.redhat.com/show_bug.cgi?id=514487 could cause the cluster to 
hang if you get "already attached" errors simultaneously on 2 different cluster 
members.

Both of these are fixed for the next release


> -----Original Message-----
> From: Alan Conway [mailto:aconway@redhat.com]
> Sent: 04 November 2009 14:10
> To: dev@qpid.apache.org
> Cc: cctrieloff@redhat.com; users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> On 11/03/2009 04:41 PM, Shan Wang wrote:
>> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
>> Cluster side we are using 0.5.752581-26.el5.
>>
>> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> I'd like to try an reproduce your issue, need some more details:
>
>>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>>> Hi All,
>>>>
>>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>>> found one broker suddenly lost response to all connected clients and
>>>> admin tools. All producer and consumer clients are still connected
>>>> but failed to consume any messages from the queue.
>
> Just to clarify: did only one broker become unresponsive or did both of them
> become unresponsive?
>
> The command line
>>>> admin tool failed with a time out error. The only error message we
>>>> found is in the log of broker 1, which said this:
>>>>
>>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> Do you still have the full logs of both brokers at the time they were
> unresponsive? Can you run the broker with
>
>    --log-enable=notify+ --log-enable=debug+:cluster
>
> for future runs so we can hopefully get a bit more information about what the
> cluster is doing at the time of the hang?
>
> What are your clients doing? Can you reproduce the problem using the sender and
> receiver examples?
>
> How many clients are running against each broker?
>
> How easy is it to reproduce the problem?
>
>>>>
>>>> After only restarted broker 1, everything starts to work again. So
>>>> surprisingly it seems when one of the brokers in the cluster suffered
>>>> a problem, the whole cluster just stalled, at least from the
>>>> consumer's point of view ( I can't be sure if the producer was
>>>> working during the down time, after back to normal, consumer did
>>>> receive messages sent sometime ago ). Consumer program uses
>>>> FailoverManager and AsyncSession, basically not far from the failover
>>>> example in the qpid developing doc. So can anyone please tell me what
>>>> the above error message means and have we seen similar problems to
>>>> the cluster before?
>
> Yes I've seen similar problems before, but believe them all to be fixed at this
> point on trunk. It might be the issue fixed by
>
> http://svn.apache.org/viewvc?view=revision&revision=799687
>
> If I can reproduce the problem then I can verify if it is fixed on trunk.
>
> Cheers,
> Alan.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>
>
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Hi Alan,

The whole cluster lost response, but qpid-tool is still able to connect to broker2 but not broker1, based on that I suppose it's broker1 became ill, and restart of broker1 cured the whole cluster.

The full log of broker1 from 31-OCT is attached. Now we have turned log levels to info+ and will apply --log-enable=debug+:cluster later.

Before hanging, there are many clients sending messages to the cluster, I don't know the exact number of clients but usually between 150-200, the update rate was about 5-10 MB/minute. The receiver was receiving messages ok but suddenly stopped working. I believe the receiver stopped working before sender, because after things back to normal, we can see very old messages in the receiver's log, but not relative recent messages commited after the problem.

The affected system carries pretty serious tasks so I can't play with it as I wish, nor did I try the sender/receiver example. But as my latest email said, the problem re-occurred this morning, this time with broker2.

The given link could be a similar issue, but the question is what caused errors in cluster?

Regards,
Shan

-----Original Message-----
From: Alan Conway [mailto:aconway@redhat.com]
Sent: 04 November 2009 14:10
To: dev@qpid.apache.org
Cc: cctrieloff@redhat.com; users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

On 11/03/2009 04:41 PM, Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

I'd like to try an reproduce your issue, need some more details:

>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of them
become unresponsive?

The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they were
unresponsive? Can you run the broker with

  --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?

>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at this
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Hi Alan,

The whole cluster lost response, but qpid-tool is still able to connect to broker2 but not broker1, based on that I suppose it's broker1 became ill, and restart of broker1 cured the whole cluster.

The full log of broker1 from 31-OCT is attached. Now we have turned log levels to info+ and will apply --log-enable=debug+:cluster later.

Before hanging, there are many clients sending messages to the cluster, I don't know the exact number of clients but usually between 150-200, the update rate was about 5-10 MB/minute. The receiver was receiving messages ok but suddenly stopped working. I believe the receiver stopped working before sender, because after things back to normal, we can see very old messages in the receiver's log, but not relative recent messages commited after the problem.

The affected system carries pretty serious tasks so I can't play with it as I wish, nor did I try the sender/receiver example. But as my latest email said, the problem re-occurred this morning, this time with broker2.

The given link could be a similar issue, but the question is what caused errors in cluster?

Regards,
Shan

-----Original Message-----
From: Alan Conway [mailto:aconway@redhat.com]
Sent: 04 November 2009 14:10
To: dev@qpid.apache.org
Cc: cctrieloff@redhat.com; users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

On 11/03/2009 04:41 PM, Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

I'd like to try an reproduce your issue, need some more details:

>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of them
become unresponsive?

The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they were
unresponsive? Can you run the broker with

  --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?

>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at this
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/03/2009 04:41 PM, Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

I'd like to try an reproduce your issue, need some more details:

>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of them 
become unresponsive?

The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they were 
unresponsive? Can you run the broker with

  --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the 
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and 
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?

>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at this 
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/03/2009 04:41 PM, Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

I'd like to try an reproduce your issue, need some more details:

>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of them 
become unresponsive?

The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they were 
unresponsive? Can you run the broker with

  --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the 
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and 
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?

>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at this 
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

I'm not sure if we hit the max limit or not, I opt to believe not, because we've never seen any complains about number of connections in the log. And yes what you described is the desired behaviour.

-----Original Message-----
From: Alan Conway [mailto:aconway@redhat.com]
Sent: 04 November 2009 14:15
To: dev@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

On 11/04/2009 07:49 AM, Shan Wang wrote:
> Another question is, if there are more client connections than the default limit 300, what will the broker react? Will it just reject any new connections? Will there be any logs for this and will the existing clients be affected?

It looks like the --max-connections option currently has no effect. I'll raise a
JIRA to fix that. Does this sound like the right behaviour to  you:

--max-connections: set the maximum number of client connections. If there are
max-connections clients connected, the broker will reject any new connection
attempts and log a warning. The existing connections are not affected.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/04/2009 07:49 AM, Shan Wang wrote:
> Another question is, if there are more client connections than the default limit 300, what will the broker react? Will it just reject any new connections? Will there be any logs for this and will the existing clients be affected?

It looks like the --max-connections option currently has no effect. I'll raise a 
JIRA to fix that. Does this sound like the right behaviour to  you:

--max-connections: set the maximum number of client connections. If there are 
max-connections clients connected, the broker will reject any new connection 
attempts and log a warning. The existing connections are not affected.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

The same problem happened again this morning, on a different broker but without any errors logged.

Another question is, if there are more client connections than the default limit 300, what will the broker react? Will it just reject any new connections? Will there be any logs for this and will the existing clients be affected?

-----Original Message-----
From: Shan Wang [mailto:Shan.Wang@igindex.co.uk]
Sent: 03 November 2009 21:42
To: dev@qpid.apache.org; cctrieloff@redhat.com
Cc: users@qpid.apache.org
Subject: RE: An ill borker brings down the whole cluster

Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.

Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 20:16
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different
>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>> shutdown cleanly, broker2 will keep on serving clients. But today we
>> found one broker suddenly lost response to all connected clients and
>> admin tools. All producer and consumer clients are still connected
>> but failed to consume any messages from the queue. The command line
>> admin tool failed with a time out error. The only error message we
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So
>> surprisingly it seems when one of the brokers in the cluster suffered
>> a problem, the whole cluster just stalled, at least from the
>> consumer's point of view ( I can't be sure if the producer was
>> working during the down time, after back to normal, consumer did
>> receive messages sent sometime ago ). Consumer program uses
>> FailoverManager and AsyncSession, basically not far from the failover
>> example in the qpid developing doc. So can anyone please tell me what
>> the above error message means and have we seen similar problems to
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of
> which had the symptom of a "transport-busy" exception. Can you try a
> trunk build and see if you have the same problems?

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

The same problem happened again this morning, on a different broker but without any errors logged.

Another question is, if there are more client connections than the default limit 300, what will the broker react? Will it just reject any new connections? Will there be any logs for this and will the existing clients be affected?

-----Original Message-----
From: Shan Wang [mailto:Shan.Wang@igindex.co.uk]
Sent: 03 November 2009 21:42
To: dev@qpid.apache.org; cctrieloff@redhat.com
Cc: users@qpid.apache.org
Subject: RE: An ill borker brings down the whole cluster

Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.

Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 20:16
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different
>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>> shutdown cleanly, broker2 will keep on serving clients. But today we
>> found one broker suddenly lost response to all connected clients and
>> admin tools. All producer and consumer clients are still connected
>> but failed to consume any messages from the queue. The command line
>> admin tool failed with a time out error. The only error message we
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So
>> surprisingly it seems when one of the brokers in the cluster suffered
>> a problem, the whole cluster just stalled, at least from the
>> consumer's point of view ( I can't be sure if the producer was
>> working during the down time, after back to normal, consumer did
>> receive messages sent sometime ago ). Consumer program uses
>> FailoverManager and AsyncSession, basically not far from the failover
>> example in the qpid developing doc. So can anyone please tell me what
>> the above error message means and have we seen similar problems to
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of
> which had the symptom of a "transport-busy" exception. Can you try a
> trunk build and see if you have the same problems?

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

yes.

Shan Wang wrote:
> We need to run the clients on redhat4 machines, does redhat provide prebuilt qpid client libs for redhat4, the only ones I have in hand are built for redhat5.
>
>
>
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 04 November 2009 15:23
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
>
> Are you able to use the matching client for the broker -  just to rule
> that out?  i.e. make sure we are not chasing something that is fixed, or
> version mismatch related.
>
> Carl.
>
> Shan Wang wrote:
>   
>> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
>> Cluster side we are using 0.5.752581-26.el5.
>>
>> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>>
>> -----Original Message-----
>> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
>> Sent: 03 November 2009 20:16
>> To: dev@qpid.apache.org
>> Cc: users@qpid.apache.org
>> Subject: Re: An ill borker brings down the whole cluster
>>
>> Alan Conway wrote:
>>
>>     
>>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>>
>>>       
>>>> Hi All,
>>>>
>>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>>> found one broker suddenly lost response to all connected clients and
>>>> admin tools. All producer and consumer clients are still connected
>>>> but failed to consume any messages from the queue. The command line
>>>> admin tool failed with a time out error. The only error message we
>>>> found is in the log of broker 1, which said this:
>>>>
>>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>>
>>>> After only restarted broker 1, everything starts to work again. So
>>>> surprisingly it seems when one of the brokers in the cluster suffered
>>>> a problem, the whole cluster just stalled, at least from the
>>>> consumer's point of view ( I can't be sure if the producer was
>>>> working during the down time, after back to normal, consumer did
>>>> receive messages sent sometime ago ). Consumer program uses
>>>> FailoverManager and AsyncSession, basically not far from the failover
>>>> example in the qpid developing doc. So can anyone please tell me what
>>>> the above error message means and have we seen similar problems to
>>>> the cluster before?
>>>>
>>>>
>>>>         
>>> There have been a number of cluster bugs fixed since 0.5, some of
>>> which had the symptom of a "transport-busy" exception. Can you try a
>>> trunk build and see if you have the same problems?
>>>
>>>       
>> or what distro and version of qpid are you running?
>>
>> Carl.
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>>
>>
>> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>>
>>
>>     
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

yes.

Shan Wang wrote:
> We need to run the clients on redhat4 machines, does redhat provide prebuilt qpid client libs for redhat4, the only ones I have in hand are built for redhat5.
>
>
>
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 04 November 2009 15:23
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
>
> Are you able to use the matching client for the broker -  just to rule
> that out?  i.e. make sure we are not chasing something that is fixed, or
> version mismatch related.
>
> Carl.
>
> Shan Wang wrote:
>   
>> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
>> Cluster side we are using 0.5.752581-26.el5.
>>
>> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>>
>> -----Original Message-----
>> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
>> Sent: 03 November 2009 20:16
>> To: dev@qpid.apache.org
>> Cc: users@qpid.apache.org
>> Subject: Re: An ill borker brings down the whole cluster
>>
>> Alan Conway wrote:
>>
>>     
>>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>>
>>>       
>>>> Hi All,
>>>>
>>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>>> found one broker suddenly lost response to all connected clients and
>>>> admin tools. All producer and consumer clients are still connected
>>>> but failed to consume any messages from the queue. The command line
>>>> admin tool failed with a time out error. The only error message we
>>>> found is in the log of broker 1, which said this:
>>>>
>>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>>
>>>> After only restarted broker 1, everything starts to work again. So
>>>> surprisingly it seems when one of the brokers in the cluster suffered
>>>> a problem, the whole cluster just stalled, at least from the
>>>> consumer's point of view ( I can't be sure if the producer was
>>>> working during the down time, after back to normal, consumer did
>>>> receive messages sent sometime ago ). Consumer program uses
>>>> FailoverManager and AsyncSession, basically not far from the failover
>>>> example in the qpid developing doc. So can anyone please tell me what
>>>> the above error message means and have we seen similar problems to
>>>> the cluster before?
>>>>
>>>>
>>>>         
>>> There have been a number of cluster bugs fixed since 0.5, some of
>>> which had the symptom of a "transport-busy" exception. Can you try a
>>> trunk build and see if you have the same problems?
>>>
>>>       
>> or what distro and version of qpid are you running?
>>
>> Carl.
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>>
>>
>> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>>
>> ---------------------------------------------------------------------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>>
>>
>>     
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

We need to run the clients on redhat4 machines, does redhat provide prebuilt qpid client libs for redhat4, the only ones I have in hand are built for redhat5.




-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 04 November 2009 15:23
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster


Are you able to use the matching client for the broker -  just to rule
that out?  i.e. make sure we are not chasing something that is fixed, or
version mismatch related.

Carl.

Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 03 November 2009 20:16
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> Alan Conway wrote:
>
>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue. The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?
>>>
>>>
>> There have been a number of cluster bugs fixed since 0.5, some of
>> which had the symptom of a "transport-busy" exception. Can you try a
>> trunk build and see if you have the same problems?
>>
>
> or what distro and version of qpid are you running?
>
> Carl.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>


The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

We need to run the clients on redhat4 machines, does redhat provide prebuilt qpid client libs for redhat4, the only ones I have in hand are built for redhat5.




-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 04 November 2009 15:23
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster


Are you able to use the matching client for the broker -  just to rule
that out?  i.e. make sure we are not chasing something that is fixed, or
version mismatch related.

Carl.

Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 03 November 2009 20:16
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> Alan Conway wrote:
>
>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue. The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?
>>>
>>>
>> There have been a number of cluster bugs fixed since 0.5, some of
>> which had the symptom of a "transport-busy" exception. Can you try a
>> trunk build and see if you have the same problems?
>>
>
> or what distro and version of qpid are you running?
>
> Carl.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>


The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

Are you able to use the matching client for the broker -  just to rule 
that out?  i.e. make sure we are not chasing something that is fixed, or
version mismatch related.

Carl.

Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 03 November 2009 20:16
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> Alan Conway wrote:
>   
>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>     
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue. The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?
>>>
>>>       
>> There have been a number of cluster bugs fixed since 0.5, some of
>> which had the symptom of a "transport-busy" exception. Can you try a
>> trunk build and see if you have the same problems?
>>     
>
> or what distro and version of qpid are you running?
>
> Carl.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

Are you able to use the matching client for the broker -  just to rule 
that out?  i.e. make sure we are not chasing something that is fixed, or
version mismatch related.

Carl.

Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.
>
> -----Original Message-----
> From: Carl Trieloff [mailto:cctrieloff@redhat.com]
> Sent: 03 November 2009 20:16
> To: dev@qpid.apache.org
> Cc: users@qpid.apache.org
> Subject: Re: An ill borker brings down the whole cluster
>
> Alan Conway wrote:
>   
>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>     
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue. The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?
>>>
>>>       
>> There have been a number of cluster bugs fixed since 0.5, some of
>> which had the symptom of a "transport-busy" exception. Can you try a
>> trunk build and see if you have the same problems?
>>     
>
> or what distro and version of qpid are you running?
>
> Carl.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.

Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 20:16
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different
>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>> shutdown cleanly, broker2 will keep on serving clients. But today we
>> found one broker suddenly lost response to all connected clients and
>> admin tools. All producer and consumer clients are still connected
>> but failed to consume any messages from the queue. The command line
>> admin tool failed with a time out error. The only error message we
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So
>> surprisingly it seems when one of the brokers in the cluster suffered
>> a problem, the whole cluster just stalled, at least from the
>> consumer's point of view ( I can't be sure if the producer was
>> working during the down time, after back to normal, consumer did
>> receive messages sent sometime ago ). Consumer program uses
>> FailoverManager and AsyncSession, basically not far from the failover
>> example in the qpid developing doc. So can anyone please tell me what
>> the above error message means and have we seen similar problems to
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of
> which had the symptom of a "transport-busy" exception. Can you try a
> trunk build and see if you have the same problems?

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

RE: An ill borker brings down the whole cluster

Posted by Shan Wang <Sh...@igindex.co.uk>.

Client side we are still using 0.4, I'm not sure about the exact version, should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.

Unfortunately I haven't got the environment to build qpid myself so I can't use latest trunk.

-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com]
Sent: 03 November 2009 20:16
To: dev@qpid.apache.org
Cc: users@qpid.apache.org
Subject: Re: An ill borker brings down the whole cluster

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different
>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>> shutdown cleanly, broker2 will keep on serving clients. But today we
>> found one broker suddenly lost response to all connected clients and
>> admin tools. All producer and consumer clients are still connected
>> but failed to consume any messages from the queue. The command line
>> admin tool failed with a time out error. The only error message we
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So
>> surprisingly it seems when one of the brokers in the cluster suffered
>> a problem, the whole cluster just stalled, at least from the
>> consumer's point of view ( I can't be sure if the producer was
>> working during the down time, after back to normal, consumer did
>> receive messages sent sometime ago ). Consumer program uses
>> FailoverManager and AsyncSession, basically not far from the failover
>> example in the qpid developing doc. So can anyone please tell me what
>> the above error message means and have we seen similar problems to
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of
> which had the symptom of a "transport-busy" exception. Can you try a
> trunk build and see if you have the same problems?

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different 
>> boxes. The cluster works fine in normal cases, ie, if broker1 is 
>> shutdown cleanly, broker2 will keep on serving clients. But today we 
>> found one broker suddenly lost response to all connected clients and 
>> admin tools. All producer and consumer clients are still connected 
>> but failed to consume any messages from the queue. The command line 
>> admin tool failed with a time out error. The only error message we 
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel 
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy: 
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) 
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So 
>> surprisingly it seems when one of the brokers in the cluster suffered 
>> a problem, the whole cluster just stalled, at least from the 
>> consumer's point of view ( I can't be sure if the producer was 
>> working during the down time, after back to normal, consumer did 
>> receive messages sent sometime ago ). Consumer program uses 
>> FailoverManager and AsyncSession, basically not far from the failover 
>> example in the qpid developing doc. So can anyone please tell me what 
>> the above error message means and have we seen similar problems to 
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of 
> which had the symptom of a "transport-busy" exception. Can you try a 
> trunk build and see if you have the same problems? 

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

Alan Conway wrote:
> On 11/03/2009 06:13 AM, Shan Wang wrote:
>> Hi All,
>>
>> We have two qpid 0.5 brokers running in cluster mode on two different 
>> boxes. The cluster works fine in normal cases, ie, if broker1 is 
>> shutdown cleanly, broker2 will keep on serving clients. But today we 
>> found one broker suddenly lost response to all connected clients and 
>> admin tools. All producer and consumer clients are still connected 
>> but failed to consume any messages from the queue. The command line 
>> admin tool failed with a time out error. The only error message we 
>> found is in the log of broker 1, which said this:
>>
>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel 
>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy: 
>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) 
>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>>
>> After only restarted broker 1, everything starts to work again. So 
>> surprisingly it seems when one of the brokers in the cluster suffered 
>> a problem, the whole cluster just stalled, at least from the 
>> consumer's point of view ( I can't be sure if the producer was 
>> working during the down time, after back to normal, consumer did 
>> receive messages sent sometime ago ). Consumer program uses 
>> FailoverManager and AsyncSession, basically not far from the failover 
>> example in the qpid developing doc. So can anyone please tell me what 
>> the above error message means and have we seen similar problems to 
>> the cluster before?
>>
>
> There have been a number of cluster bugs fixed since 0.5, some of 
> which had the symptom of a "transport-busy" exception. Can you try a 
> trunk build and see if you have the same problems? 

or what distro and version of qpid are you running?

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/03/2009 06:13 AM, Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>

There have been a number of cluster bugs fixed since 0.5, some of which had the 
symptom of a "transport-busy" exception. Can you try a trunk build and see if 
you have the same problems?

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Alan Conway <ac...@redhat.com>.

On 11/03/2009 06:13 AM, Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>

There have been a number of cluster bugs fixed since 0.5, some of which had the 
symptom of a "transport-busy" exception. Can you try a trunk build and see if 
you have the same problems?

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: An ill borker brings down the whole cluster

Posted by Carl Trieloff <cc...@redhat.com>.

I don't have enough info to comment on the root cause, Maybe Alan can 
based on the log snippet, however there is a pulg-in module that can be 
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster 
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of 
teh cluster will continue to run, but AIS will cache for the node that 
is stopped. It is required that node be evicted at some point if it does 
not get a sig-cont after a period of time. The watchdog plugin does this 
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in 
your example below.  The second part is to understand why it was 
unresponsive.

Carl.


Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving clients. But today we found one broker suddenly lost response to all connected clients and admin tools. All producer and consumer clients are still connected but failed to consume any messages from the queue. The command line admin tool failed with a time out error. The only error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems when one of the brokers in the cluster suffered a problem, the whole cluster just stalled, at least from the consumer's point of view ( I can't be sure if the producer was working during the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer program uses FailoverManager and AsyncSession, basically not far from the failover example in the qpid developing doc. So can anyone please tell me what the above error message means and have we seen similar problems to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the use of the addressee only, unless otherwise indicated. If you are not the intended recipient, please do not read, copy, use or disclose to others this message or any attachment. Please also notify the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. IG Index Ltd is a company registered in England and Wales under number 01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the Financial Services Authority. FSA Register number 114059.
>
>   


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org