You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Tom M <td...@gmail.com> on 2011/12/07 02:55:11 UTC

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Hello,

we are having a problem with our MRG (qpid) system:

* when sending messages with size of 1600bytes, a connection (used for
sending from client) does not detect the host connection is lost via
heartbeat timeout.

+ we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
client and broker hosts)

and Ethernet connection (TCP/IP) between hosts

    + for this connection we have: ConnectionSettings
connectionSettings.heartbeat = 8

    + simulating a system failure by pulling the ethernet cable to the
broker host

    + the connection close Exception is caught by the client after many
minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and not
the missed heartbeats.

    + with the same exact application (for our client), if sending messages
of 200bytes, we do get the qpid exception indicating the Connection closed
(catch TransportFailure Exception: connection closed) within 16 seconds.
For this testing, there were no other changes between the 2 cases, other
than the size of the messages sent from the client (only expanded the size
of the string in the body of the message) (1 message sent per second in
both cases).

* is this a known problem with qpid 0.7?

* is there patch to fix this for qpid 0.7?

* has this problem already been fixed in later releases?

NOTE: we have already deployed qpid 0.7 in our system, and we will not be
able to upgrade to a newer full release for many months.

I'm wondering if the problem is that the connection gets blocked with the
first TCP packet of a multiple packet message, such that the heartbeat
detection is disabled until the full message is sent. But, if the
multi-packet message can not complete (since socket is broken), the
heartbeat logic is held disabled until the multi-packet message can
complete (which in this case it can not).

thanks,

Tom Maggio

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

Attached is code for this test client.
thanks,
Tom

On Fri, Jan 6, 2012 at 7:26 PM, Tom M <td...@gmail.com> wrote:

>
> I�ve created a simpler test client (based on our deployed application) to
> test this problem.
>
> It opens a connection (with connSettings.heartbeat = 8) and session and
> then starts sending messages (once a second), using  MessageReplayTracker::send(
> ).  The size of the message is selectable by the user (messages of  6
> bytes or 1600bytes).
>
> After about 30 seconds, I pulled the network cable.
>
> * If sending msgs of only a 6 bytes, a qpid::Exception is thrown
> indicating the connection is closed (about 16 seconds after the cable was
> pulled).
>
> * If sending msgs of 1600 bytes, the client continued sending without
> detecting the connection closed (continued sending for about another
> 88seconds, then apparently stopped due to an out queue full, but never
> reporting an exception during this time).   From previous testing, it
> does eventually timeout (probably from the TCP timeout) and throws an
> exception, but not until 6 to 20mins later.
>
>
>
> I�ve attached the results from 4 runs (for each one, the network
> connection pulled at 30seconds into the xfers).
>
> Two runs sending msgs of only a 6bytes (good cases) and 2 runs sending
> msgs of 1600bytes (problem cases).
>
> For each, I�m including the logged output and pstack dumps.
>
> Note in particular trace logging for the problem case: 01_03d.   It may
> be notable that the Frame SEND�s stopped before the heartbeat timeout
> should have occurred (see details below).
>
>
>
> To help look for differences, I took pstack dumps at several points in
> time during the runs (included sequentially in the pstack.*.log files for
> each run, time is printed just above each block).  Primarily:
>
>    * pstack before connection pulled (20 seconds into xfers)
>
>    * pstack 10 seconds after connection pulled (a few seconds prior to 2x
> heartbeat rate)
>
>    * for cases sending 1600byte msgs, several more pstack�s after the
> point that heartbeat timeout should have occurred (after pulled connection,
> by:  20, 30, 40 seconds)
>
>         Then one more pstack after the sends stopped (due to output full).
>
>
>
> For each of the msg sizes, I did the first run with default logging (to
> show app logging) and then the second with QPID_LOG set to �trace+�
>
>
>
> Details about the 4 runs:
>
> * 01_03a:   sent msg�s size =   6 bytes  (successful case,  app only
> logging)
>
>        about 06:20:29 cable pulled at (about at sent msg:  30)
>
>        06:20:45 see qpid::Exception  Connection closed
>
>        Only app logging:     qpidConnTst.msgSmall.01_03a.log
>
>        pstack.msgSmall.01_03a.log
>
>
>
> * 01_03b:   sent msg�s size =   6 bytes  (successful case,  trace logging)
>
>           06:55:55  last  Heartbeat RECV & SENT
>
>        about 06:56:01  cable pulled at (about at sent msg:  30)
>
>        06:56:17  see qpid::Exception  Connection closed
>
>        Trace  logging:     qpidConnTst.msgSmall.01_03b.log
>
>        pstack.msgSmall.01_03b.log
>
>
>
> * 01_03c:   sent msg�s size = 1600 bytes  (problem case,  app only
> logging)
>
>        about 07:06:15  cable pulled at (about at sent msg:  30)
>
>          Note:  client continued sending without receiving an exception
>
>             Continued to send until  msg:  118  (additional 88 seconds
> after cable pulled),
>
>                 then blocked.
>
>             I terminated more than 4 mins later, still had not received
> exception
>
>
>
> * 01_03d:   sent msg�s size = 1600 bytes  (problem case,  app only
> logging)
>
>           07:18:27 last  Heartbeat RECV & SENT
>
>        about 07:18:28  cable pulled at (about at sent msg:  30)
>
>          Note:  client continued sending 11 more msgs (11 more seconds).
>
>             Then the client app continued to make calls  MessageReplayTracker::send(
> )
>
>                  without receiving an exception, but the traces stopped
> for Frame SENT�s
>
>                  NOTE:  this is before the  heartbeat timeout should have
> occurred.
>
>          07:18:43 debug Traffic timeout, but still no exception caught by
> app.
>
>          The app continued to make calls  MessageReplayTracker::send( )
>
>                  without receiving an exception.
>
>             Continued to send until  msg:  118  (additional 88 seconds
> after cable pulled),
>
>                 then app was blocked.
>
>             I terminated more than 4 mins later, still had not received
> exception
>
>
>
>
>
> Some additional notes about this client:
>
> * This client (as with some of the instances in our deployed application)
> is a message sender only.
>
>        * since not intending on receiving messages on this client, we
> did NOT derive this client from qpid::client::MessageListener
>
>    * sending using:   qpid::client::MessageReplayTracker::send( )
>
>    * using async sending (for speed), not calling  Session::sync( ).
>
>         However, I tried including adding  sync�s, and found that, when
> sending the 1600 byte msgs, the client would only block on the sync and
> still not get an exception for the heartbeat timeouts.
>
>     * I added a call to  Connection::isOpen( ) in the loop to see if this
> would trigger the exception, but it did not.
>
> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
> client and broker hosts) .   In our deployed systems, we will not be able
> to upgrade qpid versions for at least several months.
>
> I will post the test code in another message.
>
> Please let us know what you can find.
>
> Thanks,
>
> Tom
>
>
>
> On Thu, Dec 8, 2011 at 7:37 AM, Tom M <td...@gmail.com> wrote:
>
>> Gordon,
>> thank you for your response.
>>
>> Yes, the test with larger message has failed every time.
>> I will look at sending the trace logs and a thread dump from the client
>> from our system.
>>
>> thanks,
>> Tom Maggio
>> Raytheon Co.
>> Dallas, TX
>> 972/205-4377
>>
>>   On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <gs...@redhat.com> wrote:
>>
>>> On 12/07/2011 01:55 AM, Tom M wrote:
>>>
>>>> Hello,
>>>>
>>>> we are having a problem with our MRG (qpid) system:
>>>>
>>>> * when sending messages with size of 1600bytes, a connection (used for
>>>> sending from client) does not detect the host connection is lost via
>>>> heartbeat timeout.
>>>>
>>>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on
>>>> both
>>>> client and broker hosts)
>>>>
>>>> and Ethernet connection (TCP/IP) between hosts
>>>>
>>>>     + for this connection we have: ConnectionSettings
>>>> connectionSettings.heartbeat = 8
>>>>
>>>>     + simulating a system failure by pulling the ethernet cable to the
>>>> broker host
>>>>
>>>>     + the connection close Exception is caught by the client after many
>>>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and
>>>> not
>>>> the missed heartbeats.
>>>>
>>>>     + with the same exact application (for our client), if sending
>>>> messages
>>>> of 200bytes, we do get the qpid exception indicating the Connection
>>>> closed
>>>> (catch TransportFailure Exception: connection closed) within 16 seconds.
>>>> For this testing, there were no other changes between the 2 cases, other
>>>> than the size of the messages sent from the client (only expanded the
>>>> size
>>>> of the string in the body of the message) (1 message sent per second in
>>>> both cases).
>>>>
>>>> * is this a known problem with qpid 0.7?
>>>>
>>>
>>> No, i don't think this is a known issue.
>>>
>>>
>>> * is there patch to fix this for qpid 0.7?
>>>>
>>>> * has this problem already been fixed in later releases?
>>>>
>>>> NOTE: we have already deployed qpid 0.7 in our system, and we will not
>>>> be
>>>> able to upgrade to a newer full release for many months.
>>>>
>>>> I'm wondering if the problem is that the connection gets blocked with
>>>> the
>>>> first TCP packet of a multiple packet message, such that the heartbeat
>>>> detection is disabled until the full message is sent. But, if the
>>>> multi-packet message can not complete (since socket is broken), the
>>>> heartbeat logic is held disabled until the multi-packet message can
>>>> complete (which in this case it can not).
>>>>
>>>
>>> There is nothing that directly (intentionally) does anything like this.
>>> However it may be possible that there is some deadlock or liveness issue
>>> that prevents correct function in some cases.
>>>
>>> Is the test always failing with the larger message size? There is
>>> actually no difference in the AMQP framing for a 200 byte v a 1600 byte
>>> message. It may just be that the different timing of the larger write
>>> somehow triggers the issue.
>>>
>>> Can you get trace level logs and a thread dump from the client for a
>>> failed case?
>>>
>>> ------------------------------**------------------------------**
>>> ---------
>>> Apache Qpid - AMQP Messaging Implementation
>>> Project:      http://qpid.apache.org
>>> Use/Interact: mailto:users-subscribe@qpid.**apache.org<us...@qpid.apache.org>
>>>
>>>
>>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Gordon Sim <gs...@redhat.com>.

On 02/10/2012 01:54 PM, Gordon Sim wrote:
> On 02/08/2012 06:01 PM, Tom M wrote:
>> Hi Gordon (and team),
>> have you been able to try this yet?
>
> No, sorry... I have been on holiday and in meetings this week with
> almost no time to look at it. I will endeavour to reproduce this weekend
> and let you know by monday.
>
>> Or, should I go ahead and submit this as an bug/issue to be worked?
>
> Sure, by all means raise a JIRA to track it. You can assign to me in the
> first instance.

Fyi: I've confirmed and diagnosed the issue and raised a JIRA to track 
it (https://issues.apache.org/jira/browse/QPID-3828).

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Gordon Sim <gs...@redhat.com>.

On 02/08/2012 06:01 PM, Tom M wrote:
> Hi Gordon (and team),
> have you been able to try this yet?

No, sorry... I have been on holiday and in meetings this week with 
almost no time to look at it. I will endeavour to reproduce this weekend 
and let you know by monday.

> Or, should I go ahead and submit this as an bug/issue to be worked?

Sure, by all means raise a JIRA to track it. You can assign to me in the 
first instance.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

Hi Gordon (and team),
have you been able to try this yet?
Or, should I go ahead and submit this as an bug/issue to be worked?

thanks,
Tom

On Tue, Jan 31, 2012 at 4:03 PM, Tom M <td...@gmail.com> wrote:

> Hi Gordon (and team),
> Have you had a chance to look at this more?  (in  particular, running the
> test while pulling the network cable (or abruptly shutting down the broker
> host)).
> thanks,
> Tom
>
>
>
> On Sat, Jan 14, 2012 at 6:07 PM, Tom M <td...@gmail.com> wrote:
>
>> I don't have root privages on this system to run the script from Alan.
>> But, I did again verify that I get the same results just disconnecting
>> the network cable for the client host (while running my test client).
>> (Incidentlly, I don' t remember if I mentioned, we first detected this
>> problem when our broker host crash (so the OS, NIC, nor switch link had
>> opportunity to close gracefully), which, I assume, is a case that is
>> intended to be covered, but just pulling the network cable appears to give
>> the same result (yet, obviously, a less radical test to work with)).
>>
>>  Also, a few more notes from the orig trace logs (which you may have
>> already noticed).
>> * As seen in the failed case 01_03d  (which actually had trace logging (I
>> miss labeled in my email description)),
>> with the larger sent message size, the underlining qpid client had
>> stopped sending msg's before the heartbeat timeout would have occurred.
>> The heartbeat rate was set to 8.
>> The cable was pulled at about "msg: 30", and the last "trace SENT" was
>> for msg:41 (one msg per sec).
>> Then, 4 more messages were "sent" by the application (via
>> MessageReplayTracker) but there are no traces from the qpid client code.
>> Then, almost 16 seconds after the net disconnect, there is an indication
>> of "Traffic timeout", but there is no action by the client code after
>> this.
>>
>> In the good cases (ie. case 01_03b with the smaller msg, again net cable
>> pulled at about msg:30), we continue to see qpid client performing "SENT"s
>>  for all messages up to the "Traffic timeout".  Then the timeout occurs,
>> which is followed by the "Exception constructed" for the close (which does
>> not happen with the failed case).
>>
>> I'm wondering if an outgoing send buffering filling up is somehow
>> blocking the logic to act on the timout.
>>
>> I don't know if this is somehow affected by the actual link level getting
>> disconnected (which, if somehow related, might help to explain the
>> differing results when doing a STOP on the broker as opposed to the broker
>> host going down or net pulled.
>> Also, I wondering if just dropping the network packets will have the same
>> or different affect (particularly if the condition requires the send path
>> backing up).
>>
>> thanks,
>> Tom
>>
>>
>>   On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <ac...@redhat.com>wrote:
>>
>>>  On 01/13/2012 09:07 AM, Tom M wrote:
>>>
>>>> Hi Gordon,
>>>> sorry I didn't get back to you yesterday, but I needed to get an
>>>> opportunity to start a separate broker that I could stop on one of our
>>>> systems.
>>>>
>>>> We had seen these different results with our deployed system, and I
>>>> want to
>>>> make sure that this test client acted the same way.
>>>> (on our deployed sytem, most of failover testing had been done with a
>>>> kill
>>>> on the broker and we would see the client detect the lost connection.
>>>>  But
>>>> later, when a host died, we found this other condition, where the client
>>>> did not detect it)
>>>>
>>>> So, I ran the test:
>>>> * started a separate broker (see below)
>>>> * started the same test client, as before, with the larger msg size
>>>> * did a kill on the broker:    kill -STOP<pid>
>>>> * saw the same results as you did, the client detected the loss
>>>> connection
>>>> in about 2x heartbeat rate
>>>>
>>>> Then, to verify my earlier results, I ran the same exact test, except
>>>> this
>>>> time pulled the network cable:
>>>> * started a separate broker (same as previous run above)
>>>> * started the same test client, as before, with the larger msg size
>>>> * pulled network cable
>>>> * saw same results as my previous tests: client continued to "send" well
>>>> past the heartbeat timeout should have been (seeing same trace
>>>> messages),
>>>> until about 80 seconds later, locked up.
>>>>
>>>> Note, I've noted this result (client sending the larger msg missing the
>>>> detection of the lost connection) also happens if the broker host
>>>> abruptly
>>>> dies (which is how we first detected the problem).
>>>>
>>>>
>>>> Note: I used the following to start broker:
>>>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
>>>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir
>>>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**
>>>> dir
>>>> --auth no --config /dev/null
>>>>
>>>> So, please let me know if you can run the test again, but pulling the
>>>> network cable (I'm pulling net between broker and switch, but, I'm
>>>> pretty
>>>> sure I've seen the same when pulling net between switch and client).
>>>> thanks,
>>>> Tom
>>>>
>>>
>>> You can simulate a network cable pull by telling iptables to drop
>>> packets. Attached an old script, no warranty.
>>> WARNING: if you've got remote access only to the machine in question be
>>> careful you don't shut yourself out! The attached script only drops
>>> corosync/openais packets, so you can still ssh etc.
>>>
>>
>>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

Hi Gordon (and team),
Have you had a chance to look at this more?  (in  particular, running the
test while pulling the network cable (or abruptly shutting down the broker
host)).
thanks,
Tom



On Sat, Jan 14, 2012 at 6:07 PM, Tom M <td...@gmail.com> wrote:

> I don't have root privages on this system to run the script from Alan.
> But, I did again verify that I get the same results just disconnecting the
> network cable for the client host (while running my test client).
> (Incidentlly, I don' t remember if I mentioned, we first detected this
> problem when our broker host crash (so the OS, NIC, nor switch link had
> opportunity to close gracefully), which, I assume, is a case that is
> intended to be covered, but just pulling the network cable appears to give
> the same result (yet, obviously, a less radical test to work with)).
>
>  Also, a few more notes from the orig trace logs (which you may have
> already noticed).
> * As seen in the failed case 01_03d  (which actually had trace logging (I
> miss labeled in my email description)),
> with the larger sent message size, the underlining qpid client had stopped
> sending msg's before the heartbeat timeout would have occurred.
> The heartbeat rate was set to 8.
> The cable was pulled at about "msg: 30", and the last "trace SENT" was for
> msg:41 (one msg per sec).
> Then, 4 more messages were "sent" by the application (via
> MessageReplayTracker) but there are no traces from the qpid client code.
> Then, almost 16 seconds after the net disconnect, there is an indication
> of "Traffic timeout", but there is no action by the client code after
> this.
>
> In the good cases (ie. case 01_03b with the smaller msg, again net cable
> pulled at about msg:30), we continue to see qpid client performing "SENT"s
>  for all messages up to the "Traffic timeout".  Then the timeout occurs,
> which is followed by the "Exception constructed" for the close (which does
> not happen with the failed case).
>
> I'm wondering if an outgoing send buffering filling up is somehow blocking
> the logic to act on the timout.
>
> I don't know if this is somehow affected by the actual link level getting
> disconnected (which, if somehow related, might help to explain the
> differing results when doing a STOP on the broker as opposed to the broker
> host going down or net pulled.
> Also, I wondering if just dropping the network packets will have the same
> or different affect (particularly if the condition requires the send path
> backing up).
>
> thanks,
> Tom
>
>
>   On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <ac...@redhat.com> wrote:
>
>>  On 01/13/2012 09:07 AM, Tom M wrote:
>>
>>> Hi Gordon,
>>> sorry I didn't get back to you yesterday, but I needed to get an
>>> opportunity to start a separate broker that I could stop on one of our
>>> systems.
>>>
>>> We had seen these different results with our deployed system, and I want
>>> to
>>> make sure that this test client acted the same way.
>>> (on our deployed sytem, most of failover testing had been done with a
>>> kill
>>> on the broker and we would see the client detect the lost connection.
>>>  But
>>> later, when a host died, we found this other condition, where the client
>>> did not detect it)
>>>
>>> So, I ran the test:
>>> * started a separate broker (see below)
>>> * started the same test client, as before, with the larger msg size
>>> * did a kill on the broker:    kill -STOP<pid>
>>> * saw the same results as you did, the client detected the loss
>>> connection
>>> in about 2x heartbeat rate
>>>
>>> Then, to verify my earlier results, I ran the same exact test, except
>>> this
>>> time pulled the network cable:
>>> * started a separate broker (same as previous run above)
>>> * started the same test client, as before, with the larger msg size
>>> * pulled network cable
>>> * saw same results as my previous tests: client continued to "send" well
>>> past the heartbeat timeout should have been (seeing same trace messages),
>>> until about 80 seconds later, locked up.
>>>
>>> Note, I've noted this result (client sending the larger msg missing the
>>> detection of the lost connection) also happens if the broker host
>>> abruptly
>>> dies (which is how we first detected the problem).
>>>
>>>
>>> Note: I used the following to start broker:
>>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
>>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir
>>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**dir
>>> --auth no --config /dev/null
>>>
>>> So, please let me know if you can run the test again, but pulling the
>>> network cable (I'm pulling net between broker and switch, but, I'm pretty
>>> sure I've seen the same when pulling net between switch and client).
>>> thanks,
>>> Tom
>>>
>>
>> You can simulate a network cable pull by telling iptables to drop
>> packets. Attached an old script, no warranty.
>> WARNING: if you've got remote access only to the machine in question be
>> careful you don't shut yourself out! The attached script only drops
>> corosync/openais packets, so you can still ssh etc.
>>
>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

I don't have root privages on this system to run the script from Alan.
But, I did again verify that I get the same results just disconnecting the
network cable for the client host (while running my test client).
(Incidentlly, I don' t remember if I mentioned, we first detected this
problem when our broker host crash (so the OS, NIC, nor switch link had
opportunity to close gracefully), which, I assume, is a case that is
intended to be covered, but just pulling the network cable appears to give
the same result (yet, obviously, a less radical test to work with)).

 Also, a few more notes from the orig trace logs (which you may have
already noticed).
* As seen in the failed case 01_03d  (which actually had trace logging (I
miss labeled in my email description)),
with the larger sent message size, the underlining qpid client had stopped
sending msg's before the heartbeat timeout would have occurred.
The heartbeat rate was set to 8.
The cable was pulled at about "msg: 30", and the last "trace SENT" was for
msg:41 (one msg per sec).
Then, 4 more messages were "sent" by the application (via
MessageReplayTracker) but there are no traces from the qpid client code.
Then, almost 16 seconds after the net disconnect, there is an indication
of "Traffic timeout", but there is no action by the client code after
this.

In the good cases (ie. case 01_03b with the smaller msg, again net cable
pulled at about msg:30), we continue to see qpid client performing "SENT"s
 for all messages up to the "Traffic timeout".  Then the timeout occurs,
which is followed by the "Exception constructed" for the close (which does
not happen with the failed case).

I'm wondering if an outgoing send buffering filling up is somehow blocking
the logic to act on the timout.

I don't know if this is somehow affected by the actual link level getting
disconnected (which, if somehow related, might help to explain the
differing results when doing a STOP on the broker as opposed to the broker
host going down or net pulled.
Also, I wondering if just dropping the network packets will have the same
or different affect (particularly if the condition requires the send path
backing up).

thanks,
Tom

On Fri, Jan 13, 2012 at 8:34 AM, Alan Conway <ac...@redhat.com> wrote:

>  On 01/13/2012 09:07 AM, Tom M wrote:
>
>> Hi Gordon,
>> sorry I didn't get back to you yesterday, but I needed to get an
>> opportunity to start a separate broker that I could stop on one of our
>> systems.
>>
>> We had seen these different results with our deployed system, and I want
>> to
>> make sure that this test client acted the same way.
>> (on our deployed sytem, most of failover testing had been done with a kill
>> on the broker and we would see the client detect the lost connection.  But
>> later, when a host died, we found this other condition, where the client
>> did not detect it)
>>
>> So, I ran the test:
>> * started a separate broker (see below)
>> * started the same test client, as before, with the larger msg size
>> * did a kill on the broker:    kill -STOP<pid>
>> * saw the same results as you did, the client detected the loss connection
>> in about 2x heartbeat rate
>>
>> Then, to verify my earlier results, I ran the same exact test, except this
>> time pulled the network cable:
>> * started a separate broker (same as previous run above)
>> * started the same test client, as before, with the larger msg size
>> * pulled network cable
>> * saw same results as my previous tests: client continued to "send" well
>> past the heartbeat timeout should have been (seeing same trace messages),
>> until about 80 seconds later, locked up.
>>
>> Note, I've noted this result (client sending the larger msg missing the
>> detection of the lost connection) also happens if the broker host abruptly
>> dies (which is how we first detected the problem).
>>
>>
>> Note: I used the following to start broker:
>> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
>> /export/hps/dda/qpidd_x/log/**qpidd_x.log --worker-threads 3 --data-dir
>> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-**dir
>> --auth no --config /dev/null
>>
>> So, please let me know if you can run the test again, but pulling the
>> network cable (I'm pulling net between broker and switch, but, I'm pretty
>> sure I've seen the same when pulling net between switch and client).
>> thanks,
>> Tom
>>
>
> You can simulate a network cable pull by telling iptables to drop packets.
> Attached an old script, no warranty.
> WARNING: if you've got remote access only to the machine in question be
> careful you don't shut yourself out! The attached script only drops
> corosync/openais packets, so you can still ssh etc.
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Alan Conway <ac...@redhat.com>.

On 01/13/2012 09:07 AM, Tom M wrote:
> Hi Gordon,
> sorry I didn't get back to you yesterday, but I needed to get an
> opportunity to start a separate broker that I could stop on one of our
> systems.
>
> We had seen these different results with our deployed system, and I want to
> make sure that this test client acted the same way.
> (on our deployed sytem, most of failover testing had been done with a kill
> on the broker and we would see the client detect the lost connection.  But
> later, when a host died, we found this other condition, where the client
> did not detect it)
>
> So, I ran the test:
> * started a separate broker (see below)
> * started the same test client, as before, with the larger msg size
> * did a kill on the broker:    kill -STOP<pid>
> * saw the same results as you did, the client detected the loss connection
> in about 2x heartbeat rate
>
> Then, to verify my earlier results, I ran the same exact test, except this
> time pulled the network cable:
> * started a separate broker (same as previous run above)
> * started the same test client, as before, with the larger msg size
> * pulled network cable
> * saw same results as my previous tests: client continued to "send" well
> past the heartbeat timeout should have been (seeing same trace messages),
> until about 80 seconds later, locked up.
>
> Note, I've noted this result (client sending the larger msg missing the
> detection of the lost connection) also happens if the broker host abruptly
> dies (which is how we first detected the problem).
>
>
> Note: I used the following to start broker:
> /usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
> /export/hps/dda/qpidd_x/log/qpidd_x.log --worker-threads 3 --data-dir
> /export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-dir
> --auth no --config /dev/null
>
> So, please let me know if you can run the test again, but pulling the
> network cable (I'm pulling net between broker and switch, but, I'm pretty
> sure I've seen the same when pulling net between switch and client).
> thanks,
> Tom

You can simulate a network cable pull by telling iptables to drop packets. 
Attached an old script, no warranty.
WARNING: if you've got remote access only to the machine in question be careful 
you don't shut yourself out! The attached script only drops corosync/openais 
packets, so you can still ssh etc.

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

Hi Gordon,
sorry I didn't get back to you yesterday, but I needed to get an
opportunity to start a separate broker that I could stop on one of our
systems.

We had seen these different results with our deployed system, and I want to
make sure that this test client acted the same way.
(on our deployed sytem, most of failover testing had been done with a kill
on the broker and we would see the client detect the lost connection.  But
later, when a host died, we found this other condition, where the client
did not detect it)

So, I ran the test:
* started a separate broker (see below)
* started the same test client, as before, with the larger msg size
* did a kill on the broker:    kill -STOP <pid>
* saw the same results as you did, the client detected the loss connection
in about 2x heartbeat rate

Then, to verify my earlier results, I ran the same exact test, except this
time pulled the network cable:
* started a separate broker (same as previous run above)
* started the same test client, as before, with the larger msg size
* pulled network cable
* saw same results as my previous tests: client continued to "send" well
past the heartbeat timeout should have been (seeing same trace messages),
until about 80 seconds later, locked up.

Note, I've noted this result (client sending the larger msg missing the
detection of the lost connection) also happens if the broker host abruptly
dies (which is how we first detected the problem).

Note: I used the following to start broker:
/usr/sbin/qpidd -p 18102 --log-to-syslog no --log-to-file
/export/hps/dda/qpidd_x/log/qpidd_x.log --worker-threads 3 --data-dir
/export/hps/dda/qpidd_x/data --pid-dir /export/hps/dda/qpidd_x/pid-dir
--auth no --config /dev/null

So, please let me know if you can run the test again, but pulling the
network cable (I'm pulling net between broker and switch, but, I'm pretty
sure I've seen the same when pulling net between switch and client).
thanks,
Tom

On Thu, Jan 12, 2012 at 10:05 AM, Gordon Sim <gs...@redhat.com> wrote:

> On 01/07/2012 01:26 AM, Tom M wrote:
>
>> I’ve created a simpler test client (based on our deployed application)
>> to test this problem.
>>
>
> I ran your test client against the same package versions you listed for
> qpidd and the client lib. I didn't pull a cable (as I was testing on remote
> boxes) but instead issues a kill -STOP against the broker which should be
> similar from the perspective of the client (i.e. it will miss two
> heartbeats and abort the connection).
>
> However in all my attempts it did correctly detect the closed connection
> and issue an exception. The output was of the following form:
>
>  01_12 15:51:34  TstConn:  sending msg: 45
>>  01_12 15:51:34  TstConn:      msg sent
>>  01_12 15:51:35  TstConn:  sending msg: 46
>>  01_12 15:51:35  TstConn:      msg sent
>> 2012-01-12 10:51:35 warning Connection [42787 mrg11:5672] closed
>>
>>    01_12 15:51:36 TstConn:  connection_.isOpen() detected lost connection
>>       note:  detected with isOpen() call,  not an exception....
>>
>>  01_12 15:51:36  TstConn:  sending msg: 47
>>  01_12 15:51:36TstConn: qpid::Exception: Connection [42787 mrg11:5672]
>> closed
>>
>>
>>   ...waiting on user (allow user to reconnect cable,  so can attempt to
>> close connection on broker)
>>
>>  To continue shutdown, enter:  1
>>
>
> This is the same whichever size I choose. Would you mind verifying if you
> see your problem with a kill -STOP in place of the network cable removal?
> If not, I'll try and get a setup to test on where i can pull a physical
> cable; if you can then it confirms there is something else different in our
> test setup.
>
>
>
> ------------------------------**------------------------------**---------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.**apache.org<us...@qpid.apache.org>
>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Gordon Sim <gs...@redhat.com>.

On 01/07/2012 01:26 AM, Tom M wrote:
> I’ve created a simpler test client (based on our deployed application)
> to test this problem.

I ran your test client against the same package versions you listed for 
qpidd and the client lib. I didn't pull a cable (as I was testing on 
remote boxes) but instead issues a kill -STOP against the broker which 
should be similar from the perspective of the client (i.e. it will miss 
two heartbeats and abort the connection).

However in all my attempts it did correctly detect the closed connection 
and issue an exception. The output was of the following form:

>   01_12 15:51:34  TstConn:  sending msg: 45
>   01_12 15:51:34  TstConn:      msg sent
>   01_12 15:51:35  TstConn:  sending msg: 46
>   01_12 15:51:35  TstConn:      msg sent
> 2012-01-12 10:51:35 warning Connection [42787 mrg11:5672] closed
>
>     01_12 15:51:36 TstConn:  connection_.isOpen() detected lost connection
>        note:  detected with isOpen() call,  not an exception....
>
>   01_12 15:51:36  TstConn:  sending msg: 47
>   01_12 15:51:36TstConn: qpid::Exception: Connection [42787 mrg11:5672] closed
>
>
>    ...waiting on user (allow user to reconnect cable,  so can attempt to close connection on broker)
>
>  To continue shutdown, enter:  1

This is the same whichever size I choose. Would you mind verifying if 
you see your problem with a kill -STOP in place of the network cable 
removal? If not, I'll try and get a setup to test on where i can pull a 
physical cable; if you can then it confirms there is something else 
different in our test setup.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

I�ve created a simpler test client (based on our deployed application) to
test this problem.

It opens a connection (with connSettings.heartbeat = 8) and session and
then starts sending messages (once a second), using
MessageReplayTracker::send(
).  The size of the message is selectable by the user (messages of  6 bytes
or 1600bytes).

After about 30 seconds, I pulled the network cable.

* If sending msgs of only a 6 bytes, a qpid::Exception is thrown indicating
the connection is closed (about 16 seconds after the cable was pulled).

* If sending msgs of 1600 bytes, the client continued sending without
detecting the connection closed (continued sending for about another
88seconds, then apparently stopped due to an out queue full, but never
reporting an exception during this time).   From previous testing, it does
eventually timeout (probably from the TCP timeout) and throws an exception,
but not until 6 to 20mins later.



I�ve attached the results from 4 runs (for each one, the network connection
pulled at 30seconds into the xfers).

Two runs sending msgs of only a 6bytes (good cases) and 2 runs sending msgs
of 1600bytes (problem cases).

For each, I�m including the logged output and pstack dumps.

Note in particular trace logging for the problem case: 01_03d.   It may be
notable that the Frame SEND�s stopped before the heartbeat timeout should
have occurred (see details below).



To help look for differences, I took pstack dumps at several points in time
during the runs (included sequentially in the pstack.*.log files for each
run, time is printed just above each block).  Primarily:

   * pstack before connection pulled (20 seconds into xfers)

   * pstack 10 seconds after connection pulled (a few seconds prior to 2x
heartbeat rate)

   * for cases sending 1600byte msgs, several more pstack�s after the point
that heartbeat timeout should have occurred (after pulled connection, by:  20,
30, 40 seconds)

        Then one more pstack after the sends stopped (due to output full).



For each of the msg sizes, I did the first run with default logging (to
show app logging) and then the second with QPID_LOG set to �trace+�



Details about the 4 runs:

* 01_03a:   sent msg�s size =   6 bytes  (successful case,  app only
logging)

       about 06:20:29 cable pulled at (about at sent msg:  30)

       06:20:45 see qpid::Exception  Connection closed

       Only app logging:     qpidConnTst.msgSmall.01_03a.log

       pstack.msgSmall.01_03a.log



* 01_03b:   sent msg�s size =   6 bytes  (successful case,  trace logging)

          06:55:55  last  Heartbeat RECV & SENT

       about 06:56:01  cable pulled at (about at sent msg:  30)

       06:56:17  see qpid::Exception  Connection closed

       Trace  logging:     qpidConnTst.msgSmall.01_03b.log

       pstack.msgSmall.01_03b.log



* 01_03c:   sent msg�s size = 1600 bytes  (problem case,  app only logging)

       about 07:06:15  cable pulled at (about at sent msg:  30)

         Note:  client continued sending without receiving an exception

            Continued to send until  msg:  118  (additional 88 seconds
after cable pulled),

                then blocked.

            I terminated more than 4 mins later, still had not received
exception



* 01_03d:   sent msg�s size = 1600 bytes  (problem case,  app only logging)

          07:18:27 last  Heartbeat RECV & SENT

       about 07:18:28  cable pulled at (about at sent msg:  30)

         Note:  client continued sending 11 more msgs (11 more seconds).

            Then the client app continued to make calls
MessageReplayTracker::send(
)

                 without receiving an exception, but the traces stopped for
Frame SENT�s

                 NOTE:  this is before the  heartbeat timeout should have
occurred.

         07:18:43 debug Traffic timeout, but still no exception caught by
app.

         The app continued to make calls  MessageReplayTracker::send( )

                 without receiving an exception.

            Continued to send until  msg:  118  (additional 88 seconds
after cable pulled),

                then app was blocked.

            I terminated more than 4 mins later, still had not received
exception





Some additional notes about this client:

* This client (as with some of the instances in our deployed application)
is a message sender only.

       * since not intending on receiving messages on this client, we
did NOT derive this client from qpid::client::MessageListener

   * sending using:   qpid::client::MessageReplayTracker::send( )

   * using async sending (for speed), not calling  Session::sync( ).

        However, I tried including adding  sync�s, and found that, when
sending the 1600 byte msgs, the client would only block on the sync and
still not get an exception for the heartbeat timeouts.

    * I added a call to  Connection::isOpen( ) in the loop to see if this
would trigger the exception, but it did not.

+ we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
client and broker hosts) .   In our deployed systems, we will not be able
to upgrade qpid versions for at least several months.

I will post the test code in another message.

Please let us know what you can find.

Thanks,

Tom



On Thu, Dec 8, 2011 at 7:37 AM, Tom M <td...@gmail.com> wrote:

> Gordon,
> thank you for your response.
>
> Yes, the test with larger message has failed every time.
> I will look at sending the trace logs and a thread dump from the client
> from our system.
>
> thanks,
> Tom Maggio
> Raytheon Co.
> Dallas, TX
> 972/205-4377
>
>   On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <gs...@redhat.com> wrote:
>
>> On 12/07/2011 01:55 AM, Tom M wrote:
>>
>>> Hello,
>>>
>>> we are having a problem with our MRG (qpid) system:
>>>
>>> * when sending messages with size of 1600bytes, a connection (used for
>>> sending from client) does not detect the host connection is lost via
>>> heartbeat timeout.
>>>
>>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on
>>> both
>>> client and broker hosts)
>>>
>>> and Ethernet connection (TCP/IP) between hosts
>>>
>>>     + for this connection we have: ConnectionSettings
>>> connectionSettings.heartbeat = 8
>>>
>>>     + simulating a system failure by pulling the ethernet cable to the
>>> broker host
>>>
>>>     + the connection close Exception is caught by the client after many
>>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and
>>> not
>>> the missed heartbeats.
>>>
>>>     + with the same exact application (for our client), if sending
>>> messages
>>> of 200bytes, we do get the qpid exception indicating the Connection
>>> closed
>>> (catch TransportFailure Exception: connection closed) within 16 seconds.
>>> For this testing, there were no other changes between the 2 cases, other
>>> than the size of the messages sent from the client (only expanded the
>>> size
>>> of the string in the body of the message) (1 message sent per second in
>>> both cases).
>>>
>>> * is this a known problem with qpid 0.7?
>>>
>>
>> No, i don't think this is a known issue.
>>
>>
>> * is there patch to fix this for qpid 0.7?
>>>
>>> * has this problem already been fixed in later releases?
>>>
>>> NOTE: we have already deployed qpid 0.7 in our system, and we will not be
>>> able to upgrade to a newer full release for many months.
>>>
>>> I'm wondering if the problem is that the connection gets blocked with the
>>> first TCP packet of a multiple packet message, such that the heartbeat
>>> detection is disabled until the full message is sent. But, if the
>>> multi-packet message can not complete (since socket is broken), the
>>> heartbeat logic is held disabled until the multi-packet message can
>>> complete (which in this case it can not).
>>>
>>
>> There is nothing that directly (intentionally) does anything like this.
>> However it may be possible that there is some deadlock or liveness issue
>> that prevents correct function in some cases.
>>
>> Is the test always failing with the larger message size? There is
>> actually no difference in the AMQP framing for a 200 byte v a 1600 byte
>> message. It may just be that the different timing of the larger write
>> somehow triggers the issue.
>>
>> Can you get trace level logs and a thread dump from the client for a
>> failed case?
>>
>> ------------------------------**------------------------------**---------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:users-subscribe@qpid.**apache.org<us...@qpid.apache.org>
>>
>>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Tom M <td...@gmail.com>.

Gordon,
thank you for your response.

Yes, the test with larger message has failed every time.
I will look at sending the trace logs and a thread dump from the client
from our system.

thanks,
Tom Maggio
Raytheon Co.
Dallas, TX
972/205-4377

On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <gs...@redhat.com> wrote:

> On 12/07/2011 01:55 AM, Tom M wrote:
>
>> Hello,
>>
>> we are having a problem with our MRG (qpid) system:
>>
>> * when sending messages with size of 1600bytes, a connection (used for
>> sending from client) does not detect the host connection is lost via
>> heartbeat timeout.
>>
>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
>> client and broker hosts)
>>
>> and Ethernet connection (TCP/IP) between hosts
>>
>>     + for this connection we have: ConnectionSettings
>> connectionSettings.heartbeat = 8
>>
>>     + simulating a system failure by pulling the ethernet cable to the
>> broker host
>>
>>     + the connection close Exception is caught by the client after many
>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and not
>> the missed heartbeats.
>>
>>     + with the same exact application (for our client), if sending
>> messages
>> of 200bytes, we do get the qpid exception indicating the Connection closed
>> (catch TransportFailure Exception: connection closed) within 16 seconds.
>> For this testing, there were no other changes between the 2 cases, other
>> than the size of the messages sent from the client (only expanded the size
>> of the string in the body of the message) (1 message sent per second in
>> both cases).
>>
>> * is this a known problem with qpid 0.7?
>>
>
> No, i don't think this is a known issue.
>
>
> * is there patch to fix this for qpid 0.7?
>>
>> * has this problem already been fixed in later releases?
>>
>> NOTE: we have already deployed qpid 0.7 in our system, and we will not be
>> able to upgrade to a newer full release for many months.
>>
>> I'm wondering if the problem is that the connection gets blocked with the
>> first TCP packet of a multiple packet message, such that the heartbeat
>> detection is disabled until the full message is sent. But, if the
>> multi-packet message can not complete (since socket is broken), the
>> heartbeat logic is held disabled until the multi-packet message can
>> complete (which in this case it can not).
>>
>
> There is nothing that directly (intentionally) does anything like this.
> However it may be possible that there is some deadlock or liveness issue
> that prevents correct function in some cases.
>
> Is the test always failing with the larger message size? There is actually
> no difference in the AMQP framing for a 200 byte v a 1600 byte message. It
> may just be that the different timing of the larger write somehow triggers
> the issue.
>
> Can you get trace level logs and a thread dump from the client for a
> failed case?
>
> ------------------------------**------------------------------**---------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.**apache.org<us...@qpid.apache.org>
>
>

Re: problem with qpid heartbeats when sending msgs with size over 1KB

Posted by Gordon Sim <gs...@redhat.com>.

On 12/07/2011 01:55 AM, Tom M wrote:
> Hello,
>
> we are having a problem with our MRG (qpid) system:
>
> * when sending messages with size of 1600bytes, a connection (used for
> sending from client) does not detect the host connection is lost via
> heartbeat timeout.
>
> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
> client and broker hosts)
>
> and Ethernet connection (TCP/IP) between hosts
>
>      + for this connection we have: ConnectionSettings
> connectionSettings.heartbeat = 8
>
>      + simulating a system failure by pulling the ethernet cable to the
> broker host
>
>      + the connection close Exception is caught by the client after many
> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and not
> the missed heartbeats.
>
>      + with the same exact application (for our client), if sending messages
> of 200bytes, we do get the qpid exception indicating the Connection closed
> (catch TransportFailure Exception: connection closed) within 16 seconds.
> For this testing, there were no other changes between the 2 cases, other
> than the size of the messages sent from the client (only expanded the size
> of the string in the body of the message) (1 message sent per second in
> both cases).
>
> * is this a known problem with qpid 0.7?

No, i don't think this is a known issue.

> * is there patch to fix this for qpid 0.7?
>
> * has this problem already been fixed in later releases?
>
> NOTE: we have already deployed qpid 0.7 in our system, and we will not be
> able to upgrade to a newer full release for many months.
>
> I'm wondering if the problem is that the connection gets blocked with the
> first TCP packet of a multiple packet message, such that the heartbeat
> detection is disabled until the full message is sent. But, if the
> multi-packet message can not complete (since socket is broken), the
> heartbeat logic is held disabled until the multi-packet message can
> complete (which in this case it can not).

There is nothing that directly (intentionally) does anything like this. 
However it may be possible that there is some deadlock or liveness issue 
that prevents correct function in some cases.

Is the test always failing with the larger message size? There is 
actually no difference in the AMQP framing for a 200 byte v a 1600 byte 
message. It may just be that the different timing of the larger write 
somehow triggers the issue.

Can you get trace level logs and a thread dump from the client for a 
failed case?

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org