You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Charles Woerner <CW...@demandbase.com> on 2010/03/12 01:30:22 UTC

store and forward queue recovery problem

Hello,

I'm unable to re-establish message forwarding across a queue route  
once the destination queue reaches it's max-queue-size limit.  The  
steps to reproduce are as follows:

1) establish a src-local queue route between a web server node and a  
backend queue server

	On the web server (www-1)...

	qpid-config add queue hits_local --durable --file-count 32 --file- 
size 5120 --limit-policy flow-to-disk
	qpid-config bind amq.direct hits_local /hits
	qpid-route -d -s queue add queue-1:5672 localhost:5672 amq.direct  
hits_local

	On the queue server (queue-1)...

	qpid-config add queue hits --durable --file-count 64 --file-size  
20480 --max-queue-size 1048576000  --limit-policy flow-to-disk
	qpid-config bind amq.direct hits /hits
	
2) Enqueue 1048576000 or more bytes of data into the local queue on  
www-1, watch it flow across the federation link to the hits queue on  
queue-1.
3) Continue loading data and see (through qpid-tool) that messages are  
no longer being forwarded along the federation link, instead they are  
being retained on the hits_local queue on www-1.
4) Purge the hits queue on queue-1 using qpid-tool (call <id> purge 0)

At this point I would expect enqueued messages on www-1 to begin to  
flow across the federation link to the queue-1 server, but they do  
not.  Similarly, new enqueues to hits_local on www-1 are retained on  
www-1 rather than being forwarded across the link to queue-1.

I had to destroy the local queue, queue route, store, and link on  
www-1 and recreate them in order to induce message flow across the  
link again.  Is that expected?  What can we do differently to avoid  
data loss in a recovery scenario such as this?
__

Charles Woerner  | cwoerner@demandbase.com |   demandbase

Re: store and forward queue recovery problem

Posted by Charles Woerner <cw...@demandbase.com>.

If you give me a patched build in RPM form I'll test it for you.

On Mar 12, 2010, at 11:50 AM, Kim van der Riet wrote:

> Thanks for the detail. I had thought that you had suffered a recover
> failure in which a phase 1 recover had failed - ie the ability of the
> store to analyze the stored messages from the disk owing to some  
> sort of
> disk corruption or similar. But much of the detail here is as you have
> already described it - sorry for being vague in my request.
>
> I have looked closely at the code, and believe that there may be a bug
> in the recovery section of the code. The recovery code does not appear
> to enforce queue policy during the recovery, allowing (as I believe  
> you
> have observed) the loading of messages to exceed policy. I already  
> have
> a code fix for this - all I lack at the moment is a test case. The
> problem is that I can't run huge tests like yours, I need to expose  
> this
> behavior using a more modest case, and that cannot be done from the
> client alone. Whether-or-not message content is in fact being
> released/discarded from memory needs broker access to ascertain.  
> Perhaps
> I can find another way to test that this works from the client alone.
>
> I plan to open a Jira for this on Monday.
>
> Thanks once again for your help.
>
> On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:
>> Sure.  As I mentioned, I was running a test where I shut down the
>> consumers and enqueued large amounts of data.  I was running these
>> tests in ec2 in a store and forward (src-local queue route) topology.
>> The local s&f broker had a small-ish (10 GB) store and a single
>> durable queue with a default max-queue-size limit and a flow-to-disk
>> policy.  The s&f broker had a queue route to a  durable queue on the
>> central broker with a 100 MB store with a max-queue-size limit of 1  
>> GB
>> and a flow-to-disk policy.  Although the max-queue-size was only 1  
>> GB,
>> qpid continued to accept messages and acquire memory beyond the
>> physical memory limit and into swap - at this point it died.  So I
>> tried to restart it, but qpid kept complaining that it lost contact
>> with a child process.  So I allocated more swap and tried to restart
>> it and, although this helped me overcome the initial critical error
>> after 2 hours it was still unresponsive (and deep into swap) and had
>> not yet bound to the amqp port.  I then killed the process with a
>> "kill <pid>".  I then detached the storage device, attached it to a
>> larger machine and restarted qpid and it was able to startup cleanly.
>> However, the messages which had begun to build up on the s&f broker's
>> queue while the central broker was down were not automatically
>> delivered.
>>
>> I don't not claim that there is nothing I could have done to re-
>> establish the message flow, but it did not appear to re-establish
>> itself on it's own.  I did try deleting the link and re-creating it
>> but this did not work.  I also tried purging the central broker's
>> queue using qpid-tool thinking that maybe there was a corrupt message
>> keeping it from being able to accept new ones, but this did not work.
>> For what it's worth, the ip address of the central broker changed
>> between the initial small central broker and the upgraded larger
>> central broker, so I imagine this didn't help things.  But that's why
>> I destroyed the link and re-created it.
>>
>> When I cleared the slate and re-ran the enqueue tests between the s&f
>> broker and the larger central broker again at some point long into  
>> the
>> process (when about 2x the policy limit had been reached and messages
>> had been "flowing-to-disk" for some time) the central broker crashed
>> without any error messages.  Again messages started building up on  
>> the
>> s&f queue, but this time when I restarted qpid on the central broker
>> the link was automatically re-established and messages from the s&f
>> queue were transfered to the central broker properly.
>>
>> <shrugs>
>>
>> On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
>>
>>> On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
>>>> Wow, and nevermind.  As I wrote that the queue stats updated and
>>>> apparently the link was re-established and the entire contents of  
>>>> the
>>>> store and forward queue were now flushed to the destination broker
>>>> properly.  Seems to work as designed!  The only real problem I can
>>>> report is that when a destination broker dies due to memory
>>>> starvation
>>>> the store may be left in a state which makes is subsequently
>>>> unrecoverable.  But I am unable to reproduce on this new larger
>>>> machine.  Case closed.  Thanks, and sorry for the noise.
>>>
>>> Can you provide further details on the store unrecoverability you
>>> encountered?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Apache Qpid - AMQP Messaging Implementation
>>> Project:      http://qpid.apache.org
>>> Use/Interact: mailto:users-subscribe@qpid.apache.org
>>>
>>
>> __
>>
>> Charles Woerner  | cwoerner@demandbase.com |   demandbase   |
>> 415.683.2669
>>
>>
>>
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>

__

Charles Woerner  | cwoerner@demandbase.com |   demandbase   |   
415.683.2669

Re: store and forward queue recovery problem

Posted by Kim van der Riet <ki...@redhat.com>.

Thanks for the detail. I had thought that you had suffered a recover
failure in which a phase 1 recover had failed - ie the ability of the
store to analyze the stored messages from the disk owing to some sort of
disk corruption or similar. But much of the detail here is as you have
already described it - sorry for being vague in my request.

I have looked closely at the code, and believe that there may be a bug
in the recovery section of the code. The recovery code does not appear
to enforce queue policy during the recovery, allowing (as I believe you
have observed) the loading of messages to exceed policy. I already have
a code fix for this - all I lack at the moment is a test case. The
problem is that I can't run huge tests like yours, I need to expose this
behavior using a more modest case, and that cannot be done from the
client alone. Whether-or-not message content is in fact being
released/discarded from memory needs broker access to ascertain. Perhaps
I can find another way to test that this works from the client alone.

I plan to open a Jira for this on Monday.

Thanks once again for your help.

On Fri, 2010-03-12 at 10:41 -0800, Charles Woerner wrote:
> Sure.  As I mentioned, I was running a test where I shut down the  
> consumers and enqueued large amounts of data.  I was running these  
> tests in ec2 in a store and forward (src-local queue route) topology.   
> The local s&f broker had a small-ish (10 GB) store and a single  
> durable queue with a default max-queue-size limit and a flow-to-disk  
> policy.  The s&f broker had a queue route to a  durable queue on the  
> central broker with a 100 MB store with a max-queue-size limit of 1 GB  
> and a flow-to-disk policy.  Although the max-queue-size was only 1 GB,  
> qpid continued to accept messages and acquire memory beyond the  
> physical memory limit and into swap - at this point it died.  So I  
> tried to restart it, but qpid kept complaining that it lost contact  
> with a child process.  So I allocated more swap and tried to restart  
> it and, although this helped me overcome the initial critical error  
> after 2 hours it was still unresponsive (and deep into swap) and had  
> not yet bound to the amqp port.  I then killed the process with a  
> "kill <pid>".  I then detached the storage device, attached it to a  
> larger machine and restarted qpid and it was able to startup cleanly.   
> However, the messages which had begun to build up on the s&f broker's  
> queue while the central broker was down were not automatically  
> delivered.
> 
> I don't not claim that there is nothing I could have done to re- 
> establish the message flow, but it did not appear to re-establish  
> itself on it's own.  I did try deleting the link and re-creating it  
> but this did not work.  I also tried purging the central broker's  
> queue using qpid-tool thinking that maybe there was a corrupt message  
> keeping it from being able to accept new ones, but this did not work.   
> For what it's worth, the ip address of the central broker changed  
> between the initial small central broker and the upgraded larger  
> central broker, so I imagine this didn't help things.  But that's why  
> I destroyed the link and re-created it.
> 
> When I cleared the slate and re-ran the enqueue tests between the s&f  
> broker and the larger central broker again at some point long into the  
> process (when about 2x the policy limit had been reached and messages  
> had been "flowing-to-disk" for some time) the central broker crashed  
> without any error messages.  Again messages started building up on the  
> s&f queue, but this time when I restarted qpid on the central broker  
> the link was automatically re-established and messages from the s&f  
> queue were transfered to the central broker properly.
> 
> <shrugs>
> 
> On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:
> 
> > On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
> >> Wow, and nevermind.  As I wrote that the queue stats updated and
> >> apparently the link was re-established and the entire contents of the
> >> store and forward queue were now flushed to the destination broker
> >> properly.  Seems to work as designed!  The only real problem I can
> >> report is that when a destination broker dies due to memory  
> >> starvation
> >> the store may be left in a state which makes is subsequently
> >> unrecoverable.  But I am unable to reproduce on this new larger
> >> machine.  Case closed.  Thanks, and sorry for the noise.
> >
> > Can you provide further details on the store unrecoverability you
> > encountered?
> >
> >
> > ---------------------------------------------------------------------
> > Apache Qpid - AMQP Messaging Implementation
> > Project:      http://qpid.apache.org
> > Use/Interact: mailto:users-subscribe@qpid.apache.org
> >
> 
> __
> 
> Charles Woerner  | cwoerner@demandbase.com |   demandbase   |   
> 415.683.2669
> 
> 
> 
> 
> 
> 



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: store and forward queue recovery problem

Posted by Charles Woerner <cw...@demandbase.com>.

Sure.  As I mentioned, I was running a test where I shut down the  
consumers and enqueued large amounts of data.  I was running these  
tests in ec2 in a store and forward (src-local queue route) topology.   
The local s&f broker had a small-ish (10 GB) store and a single  
durable queue with a default max-queue-size limit and a flow-to-disk  
policy.  The s&f broker had a queue route to a  durable queue on the  
central broker with a 100 MB store with a max-queue-size limit of 1 GB  
and a flow-to-disk policy.  Although the max-queue-size was only 1 GB,  
qpid continued to accept messages and acquire memory beyond the  
physical memory limit and into swap - at this point it died.  So I  
tried to restart it, but qpid kept complaining that it lost contact  
with a child process.  So I allocated more swap and tried to restart  
it and, although this helped me overcome the initial critical error  
after 2 hours it was still unresponsive (and deep into swap) and had  
not yet bound to the amqp port.  I then killed the process with a  
"kill <pid>".  I then detached the storage device, attached it to a  
larger machine and restarted qpid and it was able to startup cleanly.   
However, the messages which had begun to build up on the s&f broker's  
queue while the central broker was down were not automatically  
delivered.

I don't not claim that there is nothing I could have done to re- 
establish the message flow, but it did not appear to re-establish  
itself on it's own.  I did try deleting the link and re-creating it  
but this did not work.  I also tried purging the central broker's  
queue using qpid-tool thinking that maybe there was a corrupt message  
keeping it from being able to accept new ones, but this did not work.   
For what it's worth, the ip address of the central broker changed  
between the initial small central broker and the upgraded larger  
central broker, so I imagine this didn't help things.  But that's why  
I destroyed the link and re-created it.

When I cleared the slate and re-ran the enqueue tests between the s&f  
broker and the larger central broker again at some point long into the  
process (when about 2x the policy limit had been reached and messages  
had been "flowing-to-disk" for some time) the central broker crashed  
without any error messages.  Again messages started building up on the  
s&f queue, but this time when I restarted qpid on the central broker  
the link was automatically re-established and messages from the s&f  
queue were transfered to the central broker properly.

<shrugs>

On Mar 12, 2010, at 5:26 AM, Kim van der Riet wrote:

> On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
>> Wow, and nevermind.  As I wrote that the queue stats updated and
>> apparently the link was re-established and the entire contents of the
>> store and forward queue were now flushed to the destination broker
>> properly.  Seems to work as designed!  The only real problem I can
>> report is that when a destination broker dies due to memory  
>> starvation
>> the store may be left in a state which makes is subsequently
>> unrecoverable.  But I am unable to reproduce on this new larger
>> machine.  Case closed.  Thanks, and sorry for the noise.
>
> Can you provide further details on the store unrecoverability you
> encountered?
>
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:users-subscribe@qpid.apache.org
>

__

Charles Woerner  | cwoerner@demandbase.com |   demandbase   |   
415.683.2669

Re: store and forward queue recovery problem

Posted by Kim van der Riet <ki...@redhat.com>.

On Thu, 2010-03-11 at 18:53 -0800, Charles Woerner wrote:
> Wow, and nevermind.  As I wrote that the queue stats updated and  
> apparently the link was re-established and the entire contents of the  
> store and forward queue were now flushed to the destination broker  
> properly.  Seems to work as designed!  The only real problem I can  
> report is that when a destination broker dies due to memory starvation  
> the store may be left in a state which makes is subsequently  
> unrecoverable.  But I am unable to reproduce on this new larger  
> machine.  Case closed.  Thanks, and sorry for the noise.

Can you provide further details on the store unrecoverability you
encountered?


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: store and forward queue recovery problem

Posted by Charles Woerner <cw...@demandbase.com>.

Wow, and nevermind.  As I wrote that the queue stats updated and  
apparently the link was re-established and the entire contents of the  
store and forward queue were now flushed to the destination broker  
properly.  Seems to work as designed!  The only real problem I can  
report is that when a destination broker dies due to memory starvation  
the store may be left in a state which makes is subsequently  
unrecoverable.  But I am unable to reproduce on this new larger  
machine.  Case closed.  Thanks, and sorry for the noise.

__

Charles Woerner  | cwoerner@demandbase.com |   demandbase

Re: store and forward queue recovery problem

Posted by Charles Woerner <cw...@demandbase.com>.

Oops, those are the stats for the store-and-forward broker...  the  
destination broker stats (after restart) are:

qpid: show queue 111
Object of type org.apache.qpid.broker:queue: (last sample time:  
02:42:59)
     Type       Element                111
     ====================================
     property   vhostRef               103
     property   name                   hits
     property   durable                True
     property   autoDelete             False
     property   exclusive              False
     property   arguments              {u'qpid.max_size': 1048576000L,  
u'qpid.file_size': 20480L, u'qpid.file_count': 64L,  
u'qpid.policy_type': u'flow_to_disk'}
     statistic  msgTotalEnqueues       2402709 messages
     statistic  msgTotalDequeues       0
     statistic  msgTxnEnqueues         0
     statistic  msgTxnDequeues         0
     statistic  msgPersistEnqueues     2402709
     statistic  msgPersistDequeues     0
     statistic  msgDepth               2402709
     statistic  byteDepth              2024285350 octets
     statistic  byteTotalEnqueues      2024285350
     statistic  byteTotalDequeues      0
     statistic  byteTxnEnqueues        0
     statistic  byteTxnDequeues        0
     statistic  bytePersistEnqueues    2024285350
     statistic  bytePersistDequeues    0
     statistic  consumerCount          0 consumers
     statistic  consumerCountHigh      0
     statistic  consumerCountLow       0
     statistic  bindingCount           2 bindings
     statistic  bindingCountHigh       2
     statistic  bindingCountLow        2
     statistic  unackedMessages        0 messages
     statistic  unackedMessagesHigh    0
     statistic  unackedMessagesLow     0
     statistic  messageLatencySamples  0
     statistic  messageLatencyMin      0
     statistic  messageLatencyMax      0
     statistic  messageLatencyAverage  0

__

Charles Woerner  | cwoerner@demandbase.com |   demandbase

Re: store and forward queue recovery problem

Posted by Charles Woerner <cw...@demandbase.com>.

To follow up...

> 2) Enqueue 1048576000 or more bytes of data into the local queue on
> www-1, watch it flow across the federation link to the hits queue on
> queue-1.

Actually, the byteDepth of the destination queue ends up exceeding the  
configured max-queue-size limit by a great deal - then the server dies  
and when it is restarted the store and forward brokers aren't ever  
able to forward messages over the federation link again.  For  
instance, at the time that I brought the destination broker back up  
the recovered queue stats were as follows:

qpid: show queue 113
Object of type org.apache.qpid.broker:queue: (last sample time:  
02:43:01)
     Type       Element                113
      
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
     property   vhostRef               105
     property   name                   hits_local
     property   durable                True
     property   autoDelete             False
     property   exclusive              False
     property   arguments              {u'qpid.file_size': 5120L,  
u'qpid.file_count': 32L, u'qpid.policy_type': u'flow_to_disk'}
     statistic  msgTotalEnqueues       3309228 messages
     statistic  msgTotalDequeues       3309228
     statistic  msgTxnEnqueues         0
     statistic  msgTxnDequeues         0
     statistic  msgPersistEnqueues     3309228
     statistic  msgPersistDequeues     3309228
     statistic  msgDepth               0
     statistic  byteDepth              0 octets
     statistic  byteTotalEnqueues      2788025056
     statistic  byteTotalDequeues      2788025056
     statistic  byteTxnEnqueues        0
     statistic  byteTxnDequeues        0
     statistic  bytePersistEnqueues    2788025056
     statistic  bytePersistDequeues    2788025056
     statistic  consumerCount          1 consumer
     statistic  consumerCountHigh      1
     statistic  consumerCountLow       1
     statistic  bindingCount           2 bindings
     statistic  bindingCountHigh       2
     statistic  bindingCountLow        2
     statistic  unackedMessages        0 messages
     statistic  unackedMessagesHigh    0
     statistic  unackedMessagesLow     0
     statistic  messageLatencySamples  0
     statistic  messageLatencyMin      0
     statistic  messageLatencyMax      0
     statistic  messageLatencyAverage  0


__

Charles Woerner  | cwoerner@demandbase.com |   demandbase