You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Charles Woerner <CW...@demandbase.com> on 2010/03/12 01:30:22 UTC
store and forward queue recovery problem
Hello,
I'm unable to re-establish message forwarding across a queue route
once the destination queue reaches it's max-queue-size limit. The
steps to reproduce are as follows:
1) establish a src-local queue route between a web server node and a
backend queue server
On the web server (www-1)...
qpid-config add queue hits_local --durable --file-count 32 --file-
size 5120 --limit-policy flow-to-disk
qpid-config bind amq.direct hits_local /hits
qpid-route -d -s queue add queue-1:5672 localhost:5672 amq.direct
hits_local
On the queue server (queue-1)...
qpid-config add queue hits --durable --file-count 64 --file-size
20480 --max-queue-size 1048576000 --limit-policy flow-to-disk
qpid-config bind amq.direct hits /hits
2) Enqueue 1048576000 or more bytes of data into the local queue on
www-1, watch it flow across the federation link to the hits queue on
queue-1.
3) Continue loading data and see (through qpid-tool) that messages are
no longer being forwarded along the federation link, instead they are
being retained on the hits_local queue on www-1.
4) Purge the hits queue on queue-1 using qpid-tool (call <id> purge 0)
At this point I would expect enqueued messages on www-1 to begin to
flow across the federation link to the queue-1 server, but they do
not. Similarly, new enqueues to hits_local on www-1 are retained on
www-1 rather than being forwarded across the link to queue-1.
I had to destroy the local queue, queue route, store, and link on
www-1 and recreate them in order to induce message flow across the
link again. Is that expected? What can we do differently to avoid
data loss in a recovery scenario such as this?
__
Charles Woerner | cwoerner@demandbase.com | demandbase
Re: store and forward queue recovery problem
Posted by Charles Woerner <cw...@demandbase.com>.
If you give me a patched build in RPM form I'll test it for you.
On Mar 12, 2010, at 11:50 AM, Kim van der Riet wrote:
> Thanks for the detail. I had thought that you had suffered a recover
> failure in which a phase 1 recover had failed - ie the ability of the
> store to analyze the stored messages from the disk owing to some
> sort of
> disk corruption or similar. But much of the detail here is as you have
> already described it - sorry for being vague in my request.
>
> I have looked closely at the code, and believe that there may be a bug
> in the recovery section of the code. The recovery code does not appear
> to enforce queue policy during the recovery, allowing (as I believe
> you
> have observed) the loading of messages to exceed policy. I already
> have
> a code fix for this - all I lack at the moment is a test case. The
> problem is that I can't run huge tests like yours, I need to expose
> this
> behavior using a more modest case, and that cannot be done from the
> client alone. Whether-or-not message content is in fact being
> released/discarded from memory needs broker access to ascertain.
> Perhaps
> I can find another way to test that this works from the client alone.
>
> I plan to open a Jira for this on Monday.
>
> Thanks once again for your help.
>
> On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:
>> Sure. As I mentioned, I was running a test where I shut down the
>> consumers and enqueued large amounts of data. I was running these
>> tests in ec2 in a store and forward (src-local queue route) topology.
>> The local s&f broker had a small-ish (10 GB) store and a single
>> durable queue with a default max-queue-size limit and a flow-to-disk
>> policy. The s&f broker had a queue route to a durable queue on the
>> central broker with a 100 MB store with a max-queue-size limit of 1
>> GB
>> and a flow-to-disk policy. Although the max-queue-size was only 1
>> GB,
>> qpid continued to accept messages and acquire memory beyond the
>> physical memory limit and into swap - at this point it died. So I
>> tried to restart it, but qpid kept complaining that it lost contact
>> with a child process. So I allocated more swap and tried to restart
>> it and, although this helped me overcome the initial critical error
>> after 2 hours it was still unresponsive (and deep into swap) and had
>> not yet bound to the amqp port. I then killed the process with a
>> "kill <pid>". I then detached the storage device, attached it to a
>> larger machine and restarted qpid and it was able to startup cleanly.
>> However, the messages which had begun to build up on the s&f broker's
>> queue while the central broker was down were not automatically
>> delivered.
>>
>> I don't not claim that there is nothing I could have done to re-
>> establish the message flow, but it did not appear to re-establish
>> itself on it's own. I did try deleting the link and re-creating it
>> but this did not work. I also tried purging the central broker's
>> queue using qpid-tool thinking that maybe there was a corrupt message
>> keeping it from being able to accept new ones, but this did not work.
>> For what it's worth, the ip address of the central broker changed
>> between the initial small central broker and the upgraded larger
>> central broker, so I imagine this didn't help things. But that's why
>> I destroyed the link and re-created it.
>>
>> When I cleared the slate and re-ran the enqueue tests between the s&f
>> broker and the larger central broker again at some point long into
>> the
>> process (when about 2x the policy limit had been reached and messages
>> had been "flowing-to-disk" for some time) the central broker crashed
>> without any error messages. Again messages started building up on
>> the
>> s&f queue, but this time when I restarted qpid on the central broker
>> the link was automatically re-established and messages from the s&f
>> queue were transfered to the central broker properly.
>>
>> <shrugs>
>>
>> On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
>>
>>> On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
>>>> Wow, and nevermind. As I wrote that the queue stats updated and
>>>> apparently the link was re-established and the entire contents of
>>>> the
>>>> store and forward queue were now flushed to the destination broker
>>>> properly. Seems to work as designed! The only real problem I can
>>>> report is that when a destination broker dies due to memory
>>>> starvation
>>>> the store may be left in a state which makes is subsequently
>>>> unrecoverable. But I am unable to reproduce on this new larger
>>>> machine. Case closed. Thanks, and sorry for the noise.
>>>
>>> Can you provide further details on the store unrecoverability you
>>> encountered?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Apache Qpid - AMQP Messaging Implementation
>>> Project: http://qpid.apache.org
>>> Use/Interact: mailto:users-subscribe@qpid.apache.org
>>>
>>
>> __
>>
>> Charles Woerner | cwoerner@demandbase.com | demandbase |
>> 415.683.2669
>>
>>
>>
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project: http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
__
Charles Woerner | cwoerner@demandbase.com | demandbase |
415.683.2669
Re: store and forward queue recovery problem
Posted by Kim van der Riet <ki...@redhat.com>.
Thanks for the detail. I had thought that you had suffered a recover
failure in which a phase 1 recover had failed - ie the ability of the
store to analyze the stored messages from the disk owing to some sort of
disk corruption or similar. But much of the detail here is as you have
already described it - sorry for being vague in my request.
I have looked closely at the code, and believe that there may be a bug
in the recovery section of the code. The recovery code does not appear
to enforce queue policy during the recovery, allowing (as I believe you
have observed) the loading of messages to exceed policy. I already have
a code fix for this - all I lack at the moment is a test case. The
problem is that I can't run huge tests like yours, I need to expose this
behavior using a more modest case, and that cannot be done from the
client alone. Whether-or-not message content is in fact being
released/discarded from memory needs broker access to ascertain. Perhaps
I can find another way to test that this works from the client alone.
I plan to open a Jira for this on Monday.
Thanks once again for your help.
On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:
> Sure. As I mentioned, I was running a test where I shut down the
> consumers and enqueued large amounts of data. I was running these
> tests in ec2 in a store and forward (src-local queue route) topology.
> The local s&f broker had a small-ish (10 GB) store and a single
> durable queue with a default max-queue-size limit and a flow-to-disk
> policy. The s&f broker had a queue route to a durable queue on the
> central broker with a 100 MB store with a max-queue-size limit of 1 GB
> and a flow-to-disk policy. Although the max-queue-size was only 1 GB,
> qpid continued to accept messages and acquire memory beyond the
> physical memory limit and into swap - at this point it died. So I
> tried to restart it, but qpid kept complaining that it lost contact
> with a child process. So I allocated more swap and tried to restart
> it and, although this helped me overcome the initial critical error
> after 2 hours it was still unresponsive (and deep into swap) and had
> not yet bound to the amqp port. I then killed the process with a
> "kill <pid>". I then detached the storage device, attached it to a
> larger machine and restarted qpid and it was able to startup cleanly.
> However, the messages which had begun to build up on the s&f broker's
> queue while the central broker was down were not automatically
> delivered.
>
> I don't not claim that there is nothing I could have done to re-
> establish the message flow, but it did not appear to re-establish
> itself on it's own. I did try deleting the link and re-creating it
> but this did not work. I also tried purging the central broker's
> queue using qpid-tool thinking that maybe there was a corrupt message
> keeping it from being able to accept new ones, but this did not work.
> For what it's worth, the ip address of the central broker changed
> between the initial small central broker and the upgraded larger
> central broker, so I imagine this didn't help things. But that's why
> I destroyed the link and re-created it.
>
> When I cleared the slate and re-ran the enqueue tests between the s&f
> broker and the larger central broker again at some point long into the
> process (when about 2x the policy limit had been reached and messages
> had been "flowing-to-disk" for some time) the central broker crashed
> without any error messages. Again messages started building up on the
> s&f queue, but this time when I restarted qpid on the central broker
> the link was automatically re-established and messages from the s&f
> queue were transfered to the central broker properly.
>
> <shrugs>
>
> On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
>
> > On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
> >> Wow, and nevermind. As I wrote that the queue stats updated and
> >> apparently the link was re-established and the entire contents of the
> >> store and forward queue were now flushed to the destination broker
> >> properly. Seems to work as designed! The only real problem I can
> >> report is that when a destination broker dies due to memory
> >> starvation
> >> the store may be left in a state which makes is subsequently
> >> unrecoverable. But I am unable to reproduce on this new larger
> >> machine. Case closed. Thanks, and sorry for the noise.
> >
> > Can you provide further details on the store unrecoverability you
> > encountered?
> >
> >
> > ---------------------------------------------------------------------
> > Apache Qpid - AMQP Messaging Implementation
> > Project: http://qpid.apache.org
> > Use/Interact: mailto:users-subscribe@qpid.apache.org
> >
>
> __
>
> Charles Woerner | cwoerner@demandbase.com | demandbase |
> 415.683.2669
>
>
>
>
>
>
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org
Re: store and forward queue recovery problem
Posted by Charles Woerner <cw...@demandbase.com>.
Sure. As I mentioned, I was running a test where I shut down the
consumers and enqueued large amounts of data. I was running these
tests in ec2 in a store and forward (src-local queue route) topology.
The local s&f broker had a small-ish (10 GB) store and a single
durable queue with a default max-queue-size limit and a flow-to-disk
policy. The s&f broker had a queue route to a durable queue on the
central broker with a 100 MB store with a max-queue-size limit of 1 GB
and a flow-to-disk policy. Although the max-queue-size was only 1 GB,
qpid continued to accept messages and acquire memory beyond the
physical memory limit and into swap - at this point it died. So I
tried to restart it, but qpid kept complaining that it lost contact
with a child process. So I allocated more swap and tried to restart
it and, although this helped me overcome the initial critical error
after 2 hours it was still unresponsive (and deep into swap) and had
not yet bound to the amqp port. I then killed the process with a
"kill <pid>". I then detached the storage device, attached it to a
larger machine and restarted qpid and it was able to startup cleanly.
However, the messages which had begun to build up on the s&f broker's
queue while the central broker was down were not automatically
delivered.
I don't not claim that there is nothing I could have done to re-
establish the message flow, but it did not appear to re-establish
itself on it's own. I did try deleting the link and re-creating it
but this did not work. I also tried purging the central broker's
queue using qpid-tool thinking that maybe there was a corrupt message
keeping it from being able to accept new ones, but this did not work.
For what it's worth, the ip address of the central broker changed
between the initial small central broker and the upgraded larger
central broker, so I imagine this didn't help things. But that's why
I destroyed the link and re-created it.
When I cleared the slate and re-ran the enqueue tests between the s&f
broker and the larger central broker again at some point long into the
process (when about 2x the policy limit had been reached and messages
had been "flowing-to-disk" for some time) the central broker crashed
without any error messages. Again messages started building up on the
s&f queue, but this time when I restarted qpid on the central broker
the link was automatically re-established and messages from the s&f
queue were transfered to the central broker properly.
<shrugs>
On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
> On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
>> Wow, and nevermind. As I wrote that the queue stats updated and
>> apparently the link was re-established and the entire contents of the
>> store and forward queue were now flushed to the destination broker
>> properly. Seems to work as designed! The only real problem I can
>> report is that when a destination broker dies due to memory
>> starvation
>> the store may be left in a state which makes is subsequently
>> unrecoverable. But I am unable to reproduce on this new larger
>> machine. Case closed. Thanks, and sorry for the noise.
>
> Can you provide further details on the store unrecoverability you
> encountered?
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project: http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>
__
Charles Woerner | cwoerner@demandbase.com | demandbase |
415.683.2669
Re: store and forward queue recovery problem
Posted by Kim van der Riet <ki...@redhat.com>.
On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
> Wow, and nevermind. As I wrote that the queue stats updated and
> apparently the link was re-established and the entire contents of the
> store and forward queue were now flushed to the destination broker
> properly. Seems to work as designed! The only real problem I can
> report is that when a destination broker dies due to memory starvation
> the store may be left in a state which makes is subsequently
> unrecoverable. But I am unable to reproduce on this new larger
> machine. Case closed. Thanks, and sorry for the noise.
Can you provide further details on the store unrecoverability you
encountered?
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org
Re: store and forward queue recovery problem
Posted by Charles Woerner <cw...@demandbase.com>.
Wow, and nevermind. As I wrote that the queue stats updated and
apparently the link was re-established and the entire contents of the
store and forward queue were now flushed to the destination broker
properly. Seems to work as designed! The only real problem I can
report is that when a destination broker dies due to memory starvation
the store may be left in a state which makes is subsequently
unrecoverable. But I am unable to reproduce on this new larger
machine. Case closed. Thanks, and sorry for the noise.
__
Charles Woerner | cwoerner@demandbase.com | demandbase
Re: store and forward queue recovery problem
Posted by Charles Woerner <cw...@demandbase.com>.
Oops, those are the stats for the store-and-forward broker... the
destination broker stats (after restart) are:
qpid: show queue 111
Object of type org.apache.qpid.broker:queue: (last sample time:
02:42:59)
Type Element 111
====================================
property vhostRef 103
property name hits
property durable True
property autoDelete False
property exclusive False
property arguments {u'qpid.max_size': 1048576000L,
u'qpid.file_size': 20480L, u'qpid.file_count': 64L,
u'qpid.policy_type': u'flow_to_disk'}
statistic msgTotalEnqueues 2402709 messages
statistic msgTotalDequeues 0
statistic msgTxnEnqueues 0
statistic msgTxnDequeues 0
statistic msgPersistEnqueues 2402709
statistic msgPersistDequeues 0
statistic msgDepth 2402709
statistic byteDepth 2024285350 octets
statistic byteTotalEnqueues 2024285350
statistic byteTotalDequeues 0
statistic byteTxnEnqueues 0
statistic byteTxnDequeues 0
statistic bytePersistEnqueues 2024285350
statistic bytePersistDequeues 0
statistic consumerCount 0 consumers
statistic consumerCountHigh 0
statistic consumerCountLow 0
statistic bindingCount 2 bindings
statistic bindingCountHigh 2
statistic bindingCountLow 2
statistic unackedMessages 0 messages
statistic unackedMessagesHigh 0
statistic unackedMessagesLow 0
statistic messageLatencySamples 0
statistic messageLatencyMin 0
statistic messageLatencyMax 0
statistic messageLatencyAverage 0
__
Charles Woerner | cwoerner@demandbase.com | demandbase
Re: store and forward queue recovery problem
Posted by Charles Woerner <cw...@demandbase.com>.
To follow up...
> 2) Enqueue 1048576000 or more bytes of data into the local queue on
> www-1, watch it flow across the federation link to the hits queue on
> queue-1.
Actually, the byteDepth of the destination queue ends up exceeding the
configured max-queue-size limit by a great deal - then the server dies
and when it is restarted the store and forward brokers aren't ever
able to forward messages over the federation link again. For
instance, at the time that I brought the destination broker back up
the recovered queue stats were as follows:
qpid: show queue 113
Object of type org.apache.qpid.broker:queue: (last sample time:
02:43:01)
Type Element 113
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
========================================================================
property vhostRef 105
property name hits_local
property durable True
property autoDelete False
property exclusive False
property arguments {u'qpid.file_size': 5120L,
u'qpid.file_count': 32L, u'qpid.policy_type': u'flow_to_disk'}
statistic msgTotalEnqueues 3309228 messages
statistic msgTotalDequeues 3309228
statistic msgTxnEnqueues 0
statistic msgTxnDequeues 0
statistic msgPersistEnqueues 3309228
statistic msgPersistDequeues 3309228
statistic msgDepth 0
statistic byteDepth 0 octets
statistic byteTotalEnqueues 2788025056
statistic byteTotalDequeues 2788025056
statistic byteTxnEnqueues 0
statistic byteTxnDequeues 0
statistic bytePersistEnqueues 2788025056
statistic bytePersistDequeues 2788025056
statistic consumerCount 1 consumer
statistic consumerCountHigh 1
statistic consumerCountLow 1
statistic bindingCount 2 bindings
statistic bindingCountHigh 2
statistic bindingCountLow 2
statistic unackedMessages 0 messages
statistic unackedMessagesHigh 0
statistic unackedMessagesLow 0
statistic messageLatencySamples 0
statistic messageLatencyMin 0
statistic messageLatencyMax 0
statistic messageLatencyAverage 0
__
Charles Woerner | cwoerner@demandbase.com | demandbase