You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@qpid.apache.org by Martin Ritchie <ri...@apache.org> on 2007/09/20 14:27:18 UTC

Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

On 19/09/2007, Robert Greig <ro...@gmail.com> wrote:
> On 19/09/2007, Carl Trieloff <cc...@redhat.com> wrote:
>
> > We test on list of platforms but it fails consistently (every time) on
> > the quad core RHEL5 64bit
> > Woodcrest machine
>
> I looked at the svn logs more carefully and there was a subsequent
> change by Martin on the M2.1 branch for QPID-572. I merged that this
> morning but on our continuous build machine this fails too.
>
> We are actually seeing some odd failures so we are actively
> investigating if there may be multiple problems.
>
> RG

Hi, It would be most helpful in resolving these intermittent failures
if people seeing these problems could post the Surefire reports.
Ideally collecting the reports on the relevant JIRA would be nice but
at least emailing the list with the details. As I said earlier in the
week I haven't seen the TopicSessionTest fail since I committed the
patch to the M2.1 branch. If any of you that have pointed out that TST
still fails have any details then I would really appreciate you
sharing them so that we can work to resolve the outstanding issues

-- 
Martin Ritchie

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Martin Ritchie <ri...@apache.org>.

On 20/09/2007, Carl Trieloff <cc...@redhat.com> wrote:
>
> > Hi, It would be most helpful in resolving these intermittent failures
> > if people seeing these problems could post the Surefire reports.
>
> I need wait for Nuno, to get something on my user-id fixed at which
> point I will post more info.
>
> This is the current failure.
>
> Running org.apache.qpid.test.client.QueueBrowserTest
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.084
> sec <<< FAILURE!

Yes, thanks Carl. If your user-id has rights to look in
qpid/java/client/target/surefire-reports

You will find a report for the QueueBrowserTest that may have more
details. Hopefully it has more useful details. It should at least tell
you which test in QueueBrowserTest failed.

-- 
Martin Ritchie

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rajith Attapattu <ra...@gmail.com>.

Carl,

Time elapsed: 10.024 suggests that it was hanging. (you have this problem on
the trunk)
Try doing a kill -3, and see if there is a deadlock ?

Regards,

Rajith

On 9/21/07, Carl Trieloff <cc...@redhat.com> wrote:
>
>
> >
> > This should be fixed now.
> >
> > However we are still seeing some other occasional failures on our
> > continuous build so this isn't over yet...
> >
> >
>
> this is our current failure...
>
> Running org.apache.qpid.test.unit.client.forwardall.CombinedTest
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.024
> sec <<< FAILURE!
>
> If it is different for you I will post the log.
>
> Carl.
>
>
>
>

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 24/09/2007, Aidan Skinner <ai...@gmail.com> wrote:
> On 9/24/07, Rupert Smith <ru...@googlemail.com> wrote:
>
> > Thanks for your patches. Does this mean that the tests in the client module
> > are now passing consistently, or do we still have more to do?
>
> I'm still seeing CombinedTest error (see QPID-589) in the same way,
> I'm still looking into exactly what's going on there.

On our build box I haven't see CombinedTest fail but overnight the
build did hang again when running the client tests. I'm investigating
that now.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Aidan Skinner <ai...@gmail.com>.

On 9/24/07, Rupert Smith <ru...@googlemail.com> wrote:

> Thanks for your patches. Does this mean that the tests in the client module
> are now passing consistently, or do we still have more to do?

I'm still seeing CombinedTest error (see QPID-589) in the same way,
I'm still looking into exactly what's going on there.

- Aidan

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rupert Smith <ru...@googlemail.com>.

Hi Robert,

Thanks for your patches. Does this mean that the tests in the client module
are now passing consistently, or do we still have more to do?

On 23/09/2007, Robert Greig <ro...@gmail.com> wrote:
>
> On 22/09/2007, Robert Greig <ro...@gmail.com> wrote:
> > I managed to get a threaddump and it shows yet another deadlock
> > involving the dispatcher. Details are attached to QPID-589.
>
> Looking at the deadlock, it occurs because during session close, it
> sends Basic.Cancel for each consumer, and the Basic.Cancel-Ok handler
> (on a separate thread) calls Dispatcher.rejectPending which in turn
> tries to acquire the dispatcher lock. Sadly the dispatcher lock is
> already held by dispatcher.run(). Dispatcher.run is trying to acquire
> the messageDeliveryLock, which is already held by the close method is
> AMQSession.
>
> I couldn't spot an obvious solution involving reordering of locks.
> However it did occur to me that it was not necessary to send a
> Basic.Cancel where we are about to close the entire session (AMQP
> channel).
>
> Does anyone disagree and think we have to send Basic.Cancel?
>
> I have committed a change to the M2 branch so that it does not send
> Basic.Cancel where the session is closing and so far on our continuous
> build there have been no test failures or deadlocks. If it turns out
> that someone knows why we must send Basic.Cancel then I will obviously
> back out that change.
>
> RG
>

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Martin Ritchie wrote:
> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
>> Robert Greig wrote:
>>> What do you mean by "add the release method from 0-10"?
>>>
>>>> I do agree that processing prefetched messages is not the ideal
>>>> behavior, however it is the only one available if you want to strictly
>>>> adhere to AMQP semantics, and I would expect it to also comply with JMS
>>>> semantics presuming that you block the session.close() call until
>>>> processing of prefetched messages is complete.
>>> Is it not really up to the client developer not to use prefetch if he
>>> wants to use NO_ACK, not call consumer.close() and still quiesce the
>>> session? For example, how do you process prefetched messages for
>>> consumers that do not have a message listener, i.e. where messages are
>>> processed using receive()?
>> In 0-8 both prefetch limits are ignored when no-ack is true, so there is
>> no way to turn off prefetch when you're using no-ack. In 0-10 it is
>> possible to use flow control with no-ack since the flow control dialog
>> has been decoupled from message acknowledgment, however it would
>> entirely defeat the purpose of no-ack to not be able to use prefetch.
>>
>> Also it doesn't really matter whether you use prefetch since even if you
>> explicitly request each message to be sent prior to processing, close()
>> could still be called after that request is sent. In other words not
>> using prefetch is the same as having a prefetch of 1 which doesn't
>> eliminate the problem it simply reduces the impact to a single message
>> at the cost of reasonable performance.
>>
>> Regarding prefetching for consumers that do not have a message listener,
>> you're right that it is inherently unsafe without the ability to release
>> messages.
>>
>> --Rafael
> 
> Is it not possible to use the message.reject in 0_8 to return
> messages. IIRC the strict AMQP 0_8 reject causes that consumer never
> to see the message again. Not a huge problem as the consumer is
> closing.

I don't think this would be interoperable. According to the spec 
language "A rejected message MAY be discarded or dead-lettered, not 
necessarily passed to another client."

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Martin Ritchie <ri...@apache.org>.

On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> Robert Greig wrote:
> > What do you mean by "add the release method from 0-10"?
> >
> >> I do agree that processing prefetched messages is not the ideal
> >> behavior, however it is the only one available if you want to strictly
> >> adhere to AMQP semantics, and I would expect it to also comply with JMS
> >> semantics presuming that you block the session.close() call until
> >> processing of prefetched messages is complete.
> >
> > Is it not really up to the client developer not to use prefetch if he
> > wants to use NO_ACK, not call consumer.close() and still quiesce the
> > session? For example, how do you process prefetched messages for
> > consumers that do not have a message listener, i.e. where messages are
> > processed using receive()?
>
> In 0-8 both prefetch limits are ignored when no-ack is true, so there is
> no way to turn off prefetch when you're using no-ack. In 0-10 it is
> possible to use flow control with no-ack since the flow control dialog
> has been decoupled from message acknowledgment, however it would
> entirely defeat the purpose of no-ack to not be able to use prefetch.
>
> Also it doesn't really matter whether you use prefetch since even if you
> explicitly request each message to be sent prior to processing, close()
> could still be called after that request is sent. In other words not
> using prefetch is the same as having a prefetch of 1 which doesn't
> eliminate the problem it simply reduces the impact to a single message
> at the cost of reasonable performance.
>
> Regarding prefetching for consumers that do not have a message listener,
> you're right that it is inherently unsafe without the ability to release
> messages.
>
> --Rafael

Is it not possible to use the message.reject in 0_8 to return
messages. IIRC the strict AMQP 0_8 reject causes that consumer never
to see the message again. Not a huge problem as the consumer is
closing.

-- 
Martin Ritchie

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> What do you mean by "add the release method from 0-10"?
> 
>> I do agree that processing prefetched messages is not the ideal
>> behavior, however it is the only one available if you want to strictly
>> adhere to AMQP semantics, and I would expect it to also comply with JMS
>> semantics presuming that you block the session.close() call until
>> processing of prefetched messages is complete.
> 
> Is it not really up to the client developer not to use prefetch if he
> wants to use NO_ACK, not call consumer.close() and still quiesce the
> session? For example, how do you process prefetched messages for
> consumers that do not have a message listener, i.e. where messages are
> processed using receive()?

In 0-8 both prefetch limits are ignored when no-ack is true, so there is 
no way to turn off prefetch when you're using no-ack. In 0-10 it is 
possible to use flow control with no-ack since the flow control dialog 
has been decoupled from message acknowledgment, however it would 
entirely defeat the purpose of no-ack to not be able to use prefetch.

Also it doesn't really matter whether you use prefetch since even if you 
explicitly request each message to be sent prior to processing, close() 
could still be called after that request is sent. In other words not 
using prefetch is the same as having a prefetch of 1 which doesn't 
eliminate the problem it simply reduces the impact to a single message 
at the cost of reasonable performance.

Regarding prefetching for consumers that do not have a message listener, 
you're right that it is inherently unsafe without the ability to release 
messages.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

What do you mean by "add the release method from 0-10"?

> I do agree that processing prefetched messages is not the ideal
> behavior, however it is the only one available if you want to strictly
> adhere to AMQP semantics, and I would expect it to also comply with JMS
> semantics presuming that you block the session.close() call until
> processing of prefetched messages is complete.

Is it not really up to the client developer not to use prefetch if he
wants to use NO_ACK, not call consumer.close() and still quiesce the
session? For example, how do you process prefetched messages for
consumers that do not have a message listener, i.e. where messages are
processed using receive()?

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 25/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> 
>> The JMS semantics are pretty clear. Close at the consumer or session
>> level is supposed to block until any in-progress message listeners are
>> finished. If there are prefetched messages remaining after the
>> in-progress listeners are finished, they either need to be returned to
>> the server (i.e. option 2 except without abusing reject), or processed
>> (option 3).
> 
> Does the JMS spec have any concept of prefetch? I will have to check.
> In progress message listeners to me means processing of a single
> message.
> 
>> Option 3 seems like a reasonable extension of JMS semantics in the
>> presence of prefetch. Option 2 (without abusing reject) seems the most
>> correct. I'm not sure why you'd ever want to do option (1). It is
>> basically the same as option (2) except any messages prefetched for that
>> consumer are now stranded for the duration of the session. This doesn't
>> seem very friendly, and certainly wouldn't be a good default.
> 
> I think 1 and 2 are different. To me, 1 would be the same as the
> ctrl-c behaviour i.e. there would be no acks so the messages would be
> requeued. 2 is the client saying "I don't want these messages".
> Ironically, I would say that 2 is the one I would probably never want
> to use since I would probably only want to reject messages I have had
> a look at.
> 
> I am not sure what you mean by "stranded for the duration of the
> session". The messages would be requeued and if there was another
> consumer they would be delivered to that consumer, no?

They can't be requeued if the client can still process them.

> In fact now that I look at what I've written I am shocked we do (2).
> It seems very wrong to me.

By option 2 (without abusing reject) I meant releasing the messages, 
i.e. sending an indicator to the broker that I'm never going to ack, 
reject, or process the given messages and it is safe to deliver them to 
any client without indicating that they may already have been processed.

This is actually fairly close to how the java broker interprets reject, 
even though it is not compliant with the spec definition of reject.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 25/09/2007, Rafael Schloming <ra...@redhat.com> wrote:

> The JMS semantics are pretty clear. Close at the consumer or session
> level is supposed to block until any in-progress message listeners are
> finished. If there are prefetched messages remaining after the
> in-progress listeners are finished, they either need to be returned to
> the server (i.e. option 2 except without abusing reject), or processed
> (option 3).

Does the JMS spec have any concept of prefetch? I will have to check.
In progress message listeners to me means processing of a single
message.

> Option 3 seems like a reasonable extension of JMS semantics in the
> presence of prefetch. Option 2 (without abusing reject) seems the most
> correct. I'm not sure why you'd ever want to do option (1). It is
> basically the same as option (2) except any messages prefetched for that
> consumer are now stranded for the duration of the session. This doesn't
> seem very friendly, and certainly wouldn't be a good default.

I think 1 and 2 are different. To me, 1 would be the same as the
ctrl-c behaviour i.e. there would be no acks so the messages would be
requeued. 2 is the client saying "I don't want these messages".
Ironically, I would say that 2 is the one I would probably never want
to use since I would probably only want to reject messages I have had
a look at.

I am not sure what you mean by "stranded for the duration of the
session". The messages would be requeued and if there was another
consumer they would be delivered to that consumer, no?

In fact now that I look at what I've written I am shocked we do (2).
It seems very wrong to me.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 25/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> 
>> It's not arbitrary. An ack informs you that a message has been
>> processed, but you can't infer one way or another from the absence of an
>> ack, therefore you *have* to deal with the possibility that these
>> messages have been processed already regardless of whether you do it by
>> setting the redelivered flag or by DLQing the message. Either way I
>> don't think it's acceptable for a routine close of a consumer to cause
>> redelivery of a slew of messages that may already have been processed.
>> It would, for example, be unacceptable to any application that requires
>> human intervention to deal with redelivered messages.
> 
> I think it is wrong to say you can DLQ a message because you have not
> received an ack. A DLQ is for cases where the client has rejected a
> message explicitly or you cannot deliver a message.

That's not what I said. What I said was the broker must have the option 
to DLQ a message if the client repeatedly terminates without 
acknowledging or releasing the message. This is something that could 
easily happen if normal termination results in unacked messages the same 
way crashing does.

In other words what I'm saying is that it is a bad thing if the broker 
cant tell the difference between normal termination and a crash.

> DLQing a message because of lack of ack hugely complicates recovery
> from the application's perspective. Consider the case of an app that
> crashes for some reason during processing and does not send an ack for
> a message.
> 
> If that message were DLQ'd then what would the app do upon startup? It
> would have to know to check a DLQ for messages before consuming from
> the normal queue, or it would require operator intervention to move
> the messages from the DLQ back onto the normal queue. Certainly in the
> environment that I work in, that would be unacceptable to most
> applications since it would lengthen and complicate the recovery
> process hugely.

How exactly an application wants to deal with recover probably depends 
on the application. For some it may be more convenient for messages to 
be on the same queue with a flag set, for others it may be more 
convenient to automatically route them to a different queue. I don't 
think the difference is material to my argument.

> To me an ack is a lower level concern - did you get the message, not
> "I can't process the message".

I'm not sure I understand this. A message level ack means that the 
message was processed, not that the message was received. Repeated 
crashing of a client is what means "I can't process the message."

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 25/09/2007, Rafael Schloming <ra...@redhat.com> wrote:

> It's not arbitrary. An ack informs you that a message has been
> processed, but you can't infer one way or another from the absence of an
> ack, therefore you *have* to deal with the possibility that these
> messages have been processed already regardless of whether you do it by
> setting the redelivered flag or by DLQing the message. Either way I
> don't think it's acceptable for a routine close of a consumer to cause
> redelivery of a slew of messages that may already have been processed.
> It would, for example, be unacceptable to any application that requires
> human intervention to deal with redelivered messages.

I think it is wrong to say you can DLQ a message because you have not
received an ack. A DLQ is for cases where the client has rejected a
message explicitly or you cannot deliver a message.

DLQing a message because of lack of ack hugely complicates recovery
from the application's perspective. Consider the case of an app that
crashes for some reason during processing and does not send an ack for
a message.

If that message were DLQ'd then what would the app do upon startup? It
would have to know to check a DLQ for messages before consuming from
the normal queue, or it would require operator intervention to move
the messages from the DLQ back onto the normal queue. Certainly in the
environment that I work in, that would be unacceptable to most
applications since it would lengthen and complicate the recovery
process hugely.

To me an ack is a lower level concern - did you get the message, not
"I can't process the message".

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 25/09/2007, Gordon Sim <gs...@redhat.com> wrote:
> 
>> I think by closing the session the application is saying it wants to
>> quit. Perhaps the close on the MessageConsumer could do something like
>> this... i.e. don't return from that close until all the messages have
>> been pumped through the listener?
> 
> I agree. Currently if I understand the code correctly (Martin correct
> me if I am wrong) a close() on the message consumer will reject all
> prefetched messages. We could perhaps have some extended JMS methods
> on the consumer to:
> 
> 1) closeImmediately (maybe closeJFDI() :-))
> 2) closeRejecting (current behaviour)
> 3) closeAfterProcessing - close after processing any prefetched messages
> 
> I personally think that the default close() should be (1) and that
> closing a session should do (1) on any unclosed consumers.

The JMS semantics are pretty clear. Close at the consumer or session 
level is supposed to block until any in-progress message listeners are 
finished. If there are prefetched messages remaining after the 
in-progress listeners are finished, they either need to be returned to 
the server (i.e. option 2 except without abusing reject), or processed 
(option 3).

Option 3 seems like a reasonable extension of JMS semantics in the 
presence of prefetch. Option 2 (without abusing reject) seems the most 
correct. I'm not sure why you'd ever want to do option (1). It is 
basically the same as option (2) except any messages prefetched for that 
consumer are now stranded for the duration of the session. This doesn't 
seem very friendly, and certainly wouldn't be a good default.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 25/09/2007, Gordon Sim <gs...@redhat.com> wrote:

> I think by closing the session the application is saying it wants to
> quit. Perhaps the close on the MessageConsumer could do something like
> this... i.e. don't return from that close until all the messages have
> been pumped through the listener?

I agree. Currently if I understand the code correctly (Martin correct
me if I am wrong) a close() on the message consumer will reject all
prefetched messages. We could perhaps have some extended JMS methods
on the consumer to:

1) closeImmediately (maybe closeJFDI() :-))
2) closeRejecting (current behaviour)
3) closeAfterProcessing - close after processing any prefetched messages

I personally think that the default close() should be (1) and that
closing a session should do (1) on any unclosed consumers.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 26/09/2007, Gordon Sim <gs...@redhat.com> wrote:

> I think I would find it odd if an implementation kept pumping through
> messages after I called close, particularly calling it on a Session.
> That should at least be an option; after all perhaps the session is
> being closed because processing has to be interrupted due to
> unavailability of some resource, or due to a user action.

I agree. Having been thinking about this, my issue is that although
from a protocol perspective prefetched messages are "delivered", I
view prefetching as an optimisation and from the client developer's
perspective those message can really be thought of as on the broker.
Therefore if you still get messages when you close() that is really
like continuing to deliver messages from the broker, at least that is
how it appears to the client developer.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Gordon Sim wrote:
>> The idea is to fail fast rather than fail subtly by using reject in a 
>> non standard way. For interoperability I think that continuing to 
>> process prefetched messages is the way to go.
> 
> I think I would find it odd if an implementation kept pumping through 
> messages after I called close, particularly calling it on a Session. 
> That should at least be an option; after all perhaps the session is 
> being closed because processing has to be interrupted due to 
> unavailability of some resource, or due to a user action.

We could always indicate to the listener that the session/consumer is 
closing, e.g. set a header indicating that the message passed to the 
listener was prefetched. That way the application itself could decide 
how urgent the need to close is and whether throwing away the message is 
warranted or not. I think we'd have to do something like this for no-ack 
since, as you pointed out, there is no way to release those messages 
back to the broker.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Gordon Sim <gs...@redhat.com>.

Rafael Schloming wrote:
> Gordon Sim wrote:
>> The reliability model in my view sets the expectation that a message 
>> stays on a queue until acked or until explicitly rejected.
> 
> I'm really not suggesting that unacked messages should be arbitrarily 
> dequeued willy-nilly. What I'm suggesting is that brokers should have 
> room to detect that a particular message is causing a client to crash. 
> See my other email for more details on this.

I actually don't see how brokers can know why a client is crashing, and 
feel that dealing with poisened messages is something for the 
application/system to handle (using reject or admin functions).

If a broker implementation decided to offer something like you are 
describing that would be fine in my view, but it would be a non-standard 
  option (and should therefore be something that can be turned off).

However I think allowing or suggesting that possibility in the spec 
would be a bad thing (unless very clearly qualified). Its important for 
applications to be able to rely on the fact that unacked messages remain 
on their original queue.

This was really a bit of a tangent though, and that issue is probably 
more relevant for the AMQP WG.

[...]
> In my view normal open/close of 
> sessions and consumers should never cause redelivery of messages. C-c, 
> kill -9, network outages are all another matter of course, but IMHO 
> session.close() or consumer.close() is the thing that you try *before* 
> resorting to C-c or kill -9.

For 0-10 I agree with you. It makes sense to release messages explicitly 
during a clean shutdown and Session.close, MessageConsumer.close seem to 
be the places to do that. (You are right of course that these are 
equivalent based on the Javadoc).

[...]
>> However I don't think that retrofitting release is any better than 
>> using reject in a way that may not be portable. Neither cases is 
>> guaranteed to work with other brokers, but adding a new method seems 
>> even less likely to be interoperable.
> 
> The idea is to fail fast rather than fail subtly by using reject in a 
> non standard way. For interoperability I think that continuing to 
> process prefetched messages is the way to go.

I think I would find it odd if an implementation kept pumping through 
messages after I called close, particularly calling it on a Session. 
That should at least be an option; after all perhaps the session is 
being closed because processing has to be interrupted due to 
unavailability of some resource, or due to a user action.

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Gordon Sim wrote:
> Rafael Schloming wrote:
>> Gordon Sim wrote:
>>> Rafael Schloming wrote:
>>>> I don't think this is the only difference. A broker can DLQ unacked 
>>>> messages, but not released messages.
>>>
>>> What are the rules around that? I would expect unacked messages to be 
>>> left on the queue, as the delivery hasn't succeeded. DLQing them 
>>> seems quite wrong to me. Certainly neither of the qpid brokers do that.
>>
>> I'm not sure the spec has language explicitly stating this, but I had 
>> always assumed it was an option. If you can't do this then a message 
>> that causes a client to repeatedly crash before it has a chance to 
>> reject will effectively block up a queue forever.
> 
> I think that is an issue that applications should deal with (or else 
> require administrator intervention). To allow unacked messages to be 
> dequeued seems an extremely bad idea to me unless there is precise rules 
> around it.
> 
> The reliability model in my view sets the expectation that a message 
> stays on a queue until acked or until explicitly rejected.

I'm really not suggesting that unacked messages should be arbitrarily 
dequeued willy-nilly. What I'm suggesting is that brokers should have 
room to detect that a particular message is causing a client to crash. 
See my other email for more details on this.

>> There is also another difference. Released messages will be available 
>> to the broker immediately, whereas unacked messages won't be available 
>> until the session closes, so a client impl can't depend on recovery of 
>> unacked messages for cleanup when it closes a consumer since those 
>> unacked messages would be stranded with that client until the whole 
>> session closes.
> 
> Yes, I agree that release is good for early indication that a message is 
> not required, and would be useful for handling MessageConsumer.close().
> 
>>> As the ack is a key reliability mechanism, allowing arbitrary DLQ 
>>> decisions based on unacked deliveries seems to me to undermine the 
>>> ack-based reliability model.
>>
>> It's not arbitrary. An ack informs you that a message has been 
>> processed, but you can't infer one way or another from the absence of 
>> an ack, therefore you *have* to deal with the possibility that these 
>> messages have been processed already regardless of whether you do it 
>> by setting the redelivered flag or by DLQing the message. 
> 
> What seems arbitrary is the decision to either leave it on the original 
> queue with the redelivered flag set or DLQ the message. Its the latter 
> option I'm against; I don't think its valid behaviour.
> 
>> Either way I don't think it's acceptable for a routine close of a 
>> consumer to cause redelivery of a slew of messages that may already 
>> have been processed. It would, for example, be unacceptable to any 
>> application that requires human intervention to deal with redelivered 
>> messages.
> 
> I agree that minimising the number of messages that the broker marks as 
> redelivered is desirable. As I said in the first mail I also think that 
> release is a valuable addition to cater for the case where there is no 
> ambiguity about processed state. My original point was that I didn't see 
> much benefit in retrofitting it to older versions of the protocol.

I would state this a bit more strongly. In my view normal open/close of 
sessions and consumers should never cause redelivery of messages. C-c, 
kill -9, network outages are all another matter of course, but IMHO 
session.close() or consumer.close() is the thing that you try *before* 
resorting to C-c or kill -9.

> (Btw, we have been talking about session.close here aren't we? i.e. not 
> MessageConsumer.close() which would I think be a better place for 
> handling any releasing).

They are pretty much the same. Session.close() is defined the same way 
as MessageConsumer.close() except it operates on all MessageConsumers, 
not just the one.

>>> [...]
>>>>> In the case of the no-ack mode, the whole aim is to allow 
>>>>> optimisation of the case where redelivery is not required (e.g. 
>>>>> often where a client has its own exclusive queue representing a 
>>>>> subscription).
>>>>
>>>> That's a good point. Releasing prefetched messages in no-ack mode 
>>>> won't actually do anything since they may have already been 
>>>> discarded. Given that I would fall back to processing all prefetched 
>>>> messages in the case of no-ack and letting the user choose to throw 
>>>> them away if that is appropriate for the application.
>>>
>>> I think by closing the session the application is saying it wants to 
>>> quit. Perhaps the close on the MessageConsumer could do something 
>>> like this... i.e. don't return from that close until all the messages 
>>> have been pumped through the listener?
>>
>> I think this would be reasonable if you wanted to avoid back-porting 
>> release to 0-8, but as the code already mis-uses reject to indicate 
>> release, I'm not sure there is much point to avoiding it.
> 
> My point was that MessageConsumer.close would perhaps be a better place 
> to try and handle the closing of consumer state (being under the 
> assumption that the debate thus far had been focused on Session.close()).

Yes, I think the JMS semantics pretty much imply that Session.close() 
calls MessageConsumer.close() for all open consumers on the session.

> However I don't think that retrofitting release is any better than using 
> reject in a way that may not be portable. Neither cases is guaranteed to 
> work with other brokers, but adding a new method seems even less likely 
> to be interoperable.

The idea is to fail fast rather than fail subtly by using reject in a 
non standard way. For interoperability I think that continuing to 
process prefetched messages is the way to go.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Gordon Sim <gs...@redhat.com>.

Rafael Schloming wrote:
> Gordon Sim wrote:
>> Rafael Schloming wrote:
>>> I don't think this is the only difference. A broker can DLQ unacked 
>>> messages, but not released messages.
>>
>> What are the rules around that? I would expect unacked messages to be 
>> left on the queue, as the delivery hasn't succeeded. DLQing them seems 
>> quite wrong to me. Certainly neither of the qpid brokers do that.
> 
> I'm not sure the spec has language explicitly stating this, but I had 
> always assumed it was an option. If you can't do this then a message 
> that causes a client to repeatedly crash before it has a chance to 
> reject will effectively block up a queue forever.

I think that is an issue that applications should deal with (or else 
require administrator intervention). To allow unacked messages to be 
dequeued seems an extremely bad idea to me unless there is precise rules 
around it.

The reliability model in my view sets the expectation that a message 
stays on a queue until acked or until explicitly rejected.

> There is also another difference. Released messages will be available to 
> the broker immediately, whereas unacked messages won't be available 
> until the session closes, so a client impl can't depend on recovery of 
> unacked messages for cleanup when it closes a consumer since those 
> unacked messages would be stranded with that client until the whole 
> session closes.

Yes, I agree that release is good for early indication that a message is 
not required, and would be useful for handling MessageConsumer.close().

>> As the ack is a key reliability mechanism, allowing arbitrary DLQ 
>> decisions based on unacked deliveries seems to me to undermine the 
>> ack-based reliability model.
> 
> It's not arbitrary. An ack informs you that a message has been 
> processed, but you can't infer one way or another from the absence of an 
> ack, therefore you *have* to deal with the possibility that these 
> messages have been processed already regardless of whether you do it by 
> setting the redelivered flag or by DLQing the message. 

What seems arbitrary is the decision to either leave it on the original 
queue with the redelivered flag set or DLQ the message. Its the latter 
option I'm against; I don't think its valid behaviour.

> Either way I 
> don't think it's acceptable for a routine close of a consumer to cause 
> redelivery of a slew of messages that may already have been processed. 
> It would, for example, be unacceptable to any application that requires 
> human intervention to deal with redelivered messages.

I agree that minimising the number of messages that the broker marks as 
redelivered is desirable. As I said in the first mail I also think that 
release is a valuable addition to cater for the case where there is no 
ambiguity about processed state. My original point was that I didn't see 
much benefit in retrofitting it to older versions of the protocol.

(Btw, we have been talking about session.close here aren't we? i.e. not 
MessageConsumer.close() which would I think be a better place for 
handling any releasing).

>> [...]
>>>> In the case of the no-ack mode, the whole aim is to allow 
>>>> optimisation of the case where redelivery is not required (e.g. 
>>>> often where a client has its own exclusive queue representing a 
>>>> subscription).
>>>
>>> That's a good point. Releasing prefetched messages in no-ack mode 
>>> won't actually do anything since they may have already been 
>>> discarded. Given that I would fall back to processing all prefetched 
>>> messages in the case of no-ack and letting the user choose to throw 
>>> them away if that is appropriate for the application.
>>
>> I think by closing the session the application is saying it wants to 
>> quit. Perhaps the close on the MessageConsumer could do something like 
>> this... i.e. don't return from that close until all the messages have 
>> been pumped through the listener?
> 
> I think this would be reasonable if you wanted to avoid back-porting 
> release to 0-8, but as the code already mis-uses reject to indicate 
> release, I'm not sure there is much point to avoiding it.

My point was that MessageConsumer.close would perhaps be a better place 
to try and handle the closing of consumer state (being under the 
assumption that the debate thus far had been focused on Session.close()).

However I don't think that retrofitting release is any better than using 
reject in a way that may not be portable. Neither cases is guaranteed to 
work with other brokers, but adding a new method seems even less likely 
to be interoperable.

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Gordon Sim wrote:
> Rafael Schloming wrote:
>> Gordon Sim wrote:
>>> The only difference between an explicit 'release' and not 
>>> acknowledging a message is that the redelivered flag will be set in 
>>> the latter case, but not the former.
>>
>> I don't think this is the only difference. A broker can DLQ unacked 
>> messages, but not released messages.
> 
> What are the rules around that? I would expect unacked messages to be 
> left on the queue, as the delivery hasn't succeeded. DLQing them seems 
> quite wrong to me. Certainly neither of the qpid brokers do that.

I'm not sure the spec has language explicitly stating this, but I had 
always assumed it was an option. If you can't do this then a message 
that causes a client to repeatedly crash before it has a chance to 
reject will effectively block up a queue forever.

There is also another difference. Released messages will be available to 
the broker immediately, whereas unacked messages won't be available 
until the session closes, so a client impl can't depend on recovery of 
unacked messages for cleanup when it closes a consumer since those 
unacked messages would be stranded with that client until the whole 
session closes.

> As the ack is a key reliability mechanism, allowing arbitrary DLQ 
> decisions based on unacked deliveries seems to me to undermine the 
> ack-based reliability model.

It's not arbitrary. An ack informs you that a message has been 
processed, but you can't infer one way or another from the absence of an 
ack, therefore you *have* to deal with the possibility that these 
messages have been processed already regardless of whether you do it by 
setting the redelivered flag or by DLQing the message. Either way I 
don't think it's acceptable for a routine close of a consumer to cause 
redelivery of a slew of messages that may already have been processed. 
It would, for example, be unacceptable to any application that requires 
human intervention to deal with redelivered messages.

> [...]
>>> In the case of the no-ack mode, the whole aim is to allow 
>>> optimisation of the case where redelivery is not required (e.g. often 
>>> where a client has its own exclusive queue representing a subscription).
>>
>> That's a good point. Releasing prefetched messages in no-ack mode 
>> won't actually do anything since they may have already been discarded. 
>> Given that I would fall back to processing all prefetched messages in 
>> the case of no-ack and letting the user choose to throw them away if 
>> that is appropriate for the application.
> 
> I think by closing the session the application is saying it wants to 
> quit. Perhaps the close on the MessageConsumer could do something like 
> this... i.e. don't return from that close until all the messages have 
> been pumped through the listener?

I think this would be reasonable if you wanted to avoid back-porting 
release to 0-8, but as the code already mis-uses reject to indicate 
release, I'm not sure there is much point to avoiding it.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Gordon Sim <gs...@redhat.com>.

Rafael Schloming wrote:
> Gordon Sim wrote:
>> The only difference between an explicit 'release' and not 
>> acknowledging a message is that the redelivered flag will be set in 
>> the latter case, but not the former.
> 
> I don't think this is the only difference. A broker can DLQ unacked 
> messages, but not released messages.

What are the rules around that? I would expect unacked messages to be 
left on the queue, as the delivery hasn't succeeded. DLQing them seems 
quite wrong to me. Certainly neither of the qpid brokers do that.

As the ack is a key reliability mechanism, allowing arbitrary DLQ 
decisions based on unacked deliveries seems to me to undermine the 
ack-based reliability model.

[...]
>> In the case of the no-ack mode, the whole aim is to allow optimisation 
>> of the case where redelivery is not required (e.g. often where a 
>> client has its own exclusive queue representing a subscription).
> 
> That's a good point. Releasing prefetched messages in no-ack mode won't 
> actually do anything since they may have already been discarded. Given 
> that I would fall back to processing all prefetched messages in the case 
> of no-ack and letting the user choose to throw them away if that is 
> appropriate for the application.

I think by closing the session the application is saying it wants to 
quit. Perhaps the close on the MessageConsumer could do something like 
this... i.e. don't return from that close until all the messages have 
been pumped through the listener?

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Gordon Sim wrote:
> Rafael Schloming wrote:
>> Robert Greig wrote:
>>> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
>>>
>>>> That's true, however I would think that the expected JMS behavior would
>>>> be for connection close to do a clean shutdown.
>>>
>>> OK. Note that this is Session.close() so it will close the channel.
>>> Apart from processing all prefetched messages (which I think it's
>>> arguable is not what someone doing a close on a session would want),
>>> what do you think a clean shutdown of a consumer would involve?
>>
>> For 0-10 I would issue a message.cancel for all subscriptions, and do 
>> an execution.sync to confirm that they are all complete. At that point 
>> I know that no more messages will be arriving and then I would issue a 
>> message.release for all the prefetched messages.
>>
>> For 0-8 I would do something similar. Issue a synchronous basic.cancel 
>> for each subscription. When they are all complete, for strict AMQP 
>> mode I would then process all the prefetched messages, and for non 
>> strict AMQP mode I would add the release method from 0-10 and use that 
>> to release prefetched messages.
> 
> The only difference between an explicit 'release' and not acknowledging 
> a message is that the redelivered flag will be set in the latter case, 
> but not the former.

I don't think this is the only difference. A broker can DLQ unacked 
messages, but not released messages.

> In each case message ordering may be lost if there are other active 
> consumers on the same queue. At present the redelivered flag (which is a 
> warning that the message *may* have been delivered once already, not a 
> statement that it has) signals this, there isn't yet an equivalent to 
> indicate potential loss of order due to release (though that will 
> hopefully come).
> 
> While I think the addition of the release method is valuable, I see no 
> real benefit in trying to retrofit it into older implementations.

It may not be worthwhile for this case alone, however the Java client 
implementation currently uses reject in a variety of ways, and the 
intent is almost always to do a release, not an actual reject.

> In the case of the no-ack mode, the whole aim is to allow optimisation 
> of the case where redelivery is not required (e.g. often where a client 
> has its own exclusive queue representing a subscription).

That's a good point. Releasing prefetched messages in no-ack mode won't 
actually do anything since they may have already been discarded. Given 
that I would fall back to processing all prefetched messages in the case 
of no-ack and letting the user choose to throw them away if that is 
appropriate for the application.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Gordon Sim <gs...@redhat.com>.

Rafael Schloming wrote:
> Robert Greig wrote:
>> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
>>
>>> That's true, however I would think that the expected JMS behavior would
>>> be for connection close to do a clean shutdown.
>>
>> OK. Note that this is Session.close() so it will close the channel.
>> Apart from processing all prefetched messages (which I think it's
>> arguable is not what someone doing a close on a session would want),
>> what do you think a clean shutdown of a consumer would involve?
> 
> For 0-10 I would issue a message.cancel for all subscriptions, and do an 
> execution.sync to confirm that they are all complete. At that point I 
> know that no more messages will be arriving and then I would issue a 
> message.release for all the prefetched messages.
> 
> For 0-8 I would do something similar. Issue a synchronous basic.cancel 
> for each subscription. When they are all complete, for strict AMQP mode 
> I would then process all the prefetched messages, and for non strict 
> AMQP mode I would add the release method from 0-10 and use that to 
> release prefetched messages.

The only difference between an explicit 'release' and not acknowledging 
a message is that the redelivered flag will be set in the latter case, 
but not the former.

In each case message ordering may be lost if there are other active 
consumers on the same queue. At present the redelivered flag (which is a 
warning that the message *may* have been delivered once already, not a 
statement that it has) signals this, there isn't yet an equivalent to 
indicate potential loss of order due to release (though that will 
hopefully come).

While I think the addition of the release method is valuable, I see no 
real benefit in trying to retrofit it into older implementations.

In the case of the no-ack mode, the whole aim is to allow optimisation 
of the case where redelivery is not required (e.g. often where a client 
has its own exclusive queue representing a subscription).

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> 
>> That's true, however I would think that the expected JMS behavior would
>> be for connection close to do a clean shutdown.
> 
> OK. Note that this is Session.close() so it will close the channel.
> Apart from processing all prefetched messages (which I think it's
> arguable is not what someone doing a close on a session would want),
> what do you think a clean shutdown of a consumer would involve?

For 0-10 I would issue a message.cancel for all subscriptions, and do an 
execution.sync to confirm that they are all complete. At that point I 
know that no more messages will be arriving and then I would issue a 
message.release for all the prefetched messages.

For 0-8 I would do something similar. Issue a synchronous basic.cancel 
for each subscription. When they are all complete, for strict AMQP mode 
I would then process all the prefetched messages, and for non strict 
AMQP mode I would add the release method from 0-10 and use that to 
release prefetched messages.

In all cases the session would be quiesced and it would be safe to issue 
a session.close().

I do agree that processing prefetched messages is not the ideal 
behavior, however it is the only one available if you want to strictly 
adhere to AMQP semantics, and I would expect it to also comply with JMS 
semantics presuming that you block the session.close() call until 
processing of prefetched messages is complete.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:

> That's true, however I would think that the expected JMS behavior would
> be for connection close to do a clean shutdown.

OK. Note that this is Session.close() so it will close the channel.
Apart from processing all prefetched messages (which I think it's
arguable is not what someone doing a close on a session would want),
what do you think a clean shutdown of a consumer would involve?

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> 
>> The java broker may not have a DLQ, but any broker being conservative
>> about exactly once semantics will need to have a DLQ for messages that
>> may have been processed by a client. Messages sent on a connection that
>> was aborted would fall into this category.
> 
> OK I can see your point but this implies also (I think?) that when the
> consumer calls close() that it processes any messages that have been
> prefetched. We certainly do not do that.
> 
> In JMS, close() is the only method that can be called by any thread
> and we simply stop processing. Are you suggesting that when you call
> close() on a session it should deliver all prefetched messages on all
> consumers?

In 0-10 there is a message.release that may be used to inform the broker 
that prefetched messages were not actually processed. I don't think 
there is a way to do this in 0-8 without either extending the spec or, 
as you suggest, processing all prefetched messages.

>> There is a difference between a clean shutdown and an abort. A clean
>> shutdown will always involve some sort of handshake. So while you
>> definitely want to be as graceful as possible in the case of an abort,
>> there will fundamentally be unresolved state without the handshake, and
>> many applications will not be able to tolerate that unresolved state.
> 
> Any application that needs that can call close() explicitly on the consumer.

That's true, however I would think that the expected JMS behavior would 
be for connection close to do a clean shutdown.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:

> The java broker may not have a DLQ, but any broker being conservative
> about exactly once semantics will need to have a DLQ for messages that
> may have been processed by a client. Messages sent on a connection that
> was aborted would fall into this category.

OK I can see your point but this implies also (I think?) that when the
consumer calls close() that it processes any messages that have been
prefetched. We certainly do not do that.

In JMS, close() is the only method that can be called by any thread
and we simply stop processing. Are you suggesting that when you call
close() on a session it should deliver all prefetched messages on all
consumers?

> There is a difference between a clean shutdown and an abort. A clean
> shutdown will always involve some sort of handshake. So while you
> definitely want to be as graceful as possible in the case of an abort,
> there will fundamentally be unresolved state without the handshake, and
> many applications will not be able to tolerate that unresolved state.

Any application that needs that can call close() explicitly on the consumer.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:
> 
>> Strictly speaking I think you do need to send a basic.cancel. Without
>> sending a basic.cancel and getting confirmation that the cancel is
>> complete the broker will still be attempting to transmit messages to a
>> client when the close occurs. If this happens when there is active
>> message flow then there will pending messages when the close occurs and
>> depending on how a broker behaves, this could cause messages to be
>> unnecessarily DLQed, or unnecessarily lost in the case of no-ack.
> 
> Hmm. There is no DLQ though? Also if you have no-ack there is risk of
> message loss built into that?

The java broker may not have a DLQ, but any broker being conservative 
about exactly once semantics will need to have a DLQ for messages that 
may have been processed by a client. Messages sent on a connection that 
was aborted would fall into this category.

As for no-ack, there is a big difference between losing messages when 
the network dies, and losing messages whenever you close a connection. 
There are many applications that can tolerate the former, but not the 
latter.

> My logic was that it *must* work when you ctrl-C or kill -9  the client.

There is a difference between a clean shutdown and an abort. A clean 
shutdown will always involve some sort of handshake. So while you 
definitely want to be as graceful as possible in the case of an abort, 
there will fundamentally be unresolved state without the handshake, and 
many applications will not be able to tolerate that unresolved state.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 24/09/2007, Rafael Schloming <ra...@redhat.com> wrote:

> Strictly speaking I think you do need to send a basic.cancel. Without
> sending a basic.cancel and getting confirmation that the cancel is
> complete the broker will still be attempting to transmit messages to a
> client when the close occurs. If this happens when there is active
> message flow then there will pending messages when the close occurs and
> depending on how a broker behaves, this could cause messages to be
> unnecessarily DLQed, or unnecessarily lost in the case of no-ack.

Hmm. There is no DLQ though? Also if you have no-ack there is risk of
message loss built into that?

My logic was that it *must* work when you ctrl-C or kill -9  the client.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Rafael Schloming <ra...@redhat.com>.

Robert Greig wrote:
> On 22/09/2007, Robert Greig <ro...@gmail.com> wrote:
>> I managed to get a threaddump and it shows yet another deadlock
>> involving the dispatcher. Details are attached to QPID-589.
> 
> Looking at the deadlock, it occurs because during session close, it
> sends Basic.Cancel for each consumer, and the Basic.Cancel-Ok handler
> (on a separate thread) calls Dispatcher.rejectPending which in turn
> tries to acquire the dispatcher lock. Sadly the dispatcher lock is
> already held by dispatcher.run(). Dispatcher.run is trying to acquire
> the messageDeliveryLock, which is already held by the close method is
> AMQSession.
> 
> I couldn't spot an obvious solution involving reordering of locks.
> However it did occur to me that it was not necessary to send a
> Basic.Cancel where we are about to close the entire session (AMQP
> channel).
> 
> Does anyone disagree and think we have to send Basic.Cancel?
> 
> I have committed a change to the M2 branch so that it does not send
> Basic.Cancel where the session is closing and so far on our continuous
> build there have been no test failures or deadlocks. If it turns out
> that someone knows why we must send Basic.Cancel then I will obviously
> back out that change.

Strictly speaking I think you do need to send a basic.cancel. Without 
sending a basic.cancel and getting confirmation that the cancel is 
complete the broker will still be attempting to transmit messages to a 
client when the close occurs. If this happens when there is active 
message flow then there will pending messages when the close occurs and 
depending on how a broker behaves, this could cause messages to be 
unnecessarily DLQed, or unnecessarily lost in the case of no-ack.

--Rafael

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 22/09/2007, Robert Greig <ro...@gmail.com> wrote:
> I managed to get a threaddump and it shows yet another deadlock
> involving the dispatcher. Details are attached to QPID-589.

Looking at the deadlock, it occurs because during session close, it
sends Basic.Cancel for each consumer, and the Basic.Cancel-Ok handler
(on a separate thread) calls Dispatcher.rejectPending which in turn
tries to acquire the dispatcher lock. Sadly the dispatcher lock is
already held by dispatcher.run(). Dispatcher.run is trying to acquire
the messageDeliveryLock, which is already held by the close method is
AMQSession.

I couldn't spot an obvious solution involving reordering of locks.
However it did occur to me that it was not necessary to send a
Basic.Cancel where we are about to close the entire session (AMQP
channel).

Does anyone disagree and think we have to send Basic.Cancel?

I have committed a change to the M2 branch so that it does not send
Basic.Cancel where the session is closing and so far on our continuous
build there have been no test failures or deadlocks. If it turns out
that someone knows why we must send Basic.Cancel then I will obviously
back out that change.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 22/09/2007, Robert Greig <ro...@gmail.com> wrote:
> On 21/09/2007, Nuno Santos <ns...@redhat.com> wrote:
>
> > Here it's actually hanging at that very test,
> > org.apache.qpid.test.unit.client.forwardall.CombinedTest. No output in
> > the surefire logs, just hangs indefinitely. It may be a local issue,
> > I'll try to troubleshoot further.
>
> If you can get a thread dump that would be very useful. If you could
> attach it to QPID-589 that would be ideal.

I managed to get a threaddump and it shows yet another deadlock
involving the dispatcher. Details are attached to QPID-589.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 21/09/2007, Nuno Santos <ns...@redhat.com> wrote:

> Here it's actually hanging at that very test,
> org.apache.qpid.test.unit.client.forwardall.CombinedTest. No output in
> the surefire logs, just hangs indefinitely. It may be a local issue,
> I'll try to troubleshoot further.

If you can get a thread dump that would be very useful. If you could
attach it to QPID-589 that would be ideal.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Nuno Santos <ns...@redhat.com>.

Robert Greig wrote:
> On 21/09/2007, Robert Greig <ro...@gmail.com> wrote:
> 
>>> Running org.apache.qpid.test.unit.client.forwardall.CombinedTest
>>> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.024
>>> sec <<< FAILURE!
> 
> In diagnosing this I found QPID-607. I am not sure if it is directly
> related but it is certainly not good beheaviour.
> 
> I have checked in a fix for QPID-607 and for now at least our M2
> continuous build is passing all tests.
> 
> How does the RH continuous build look now?

Here it's actually hanging at that very test, 
org.apache.qpid.test.unit.client.forwardall.CombinedTest. No output in 
the surefire logs, just hangs indefinitely. It may be a local issue, 
I'll try to troubleshoot further.

Nuno

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 21/09/2007, Robert Greig <ro...@gmail.com> wrote:

> > Running org.apache.qpid.test.unit.client.forwardall.CombinedTest
> > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.024
> > sec <<< FAILURE!

In diagnosing this I found QPID-607. I am not sure if it is directly
related but it is certainly not good beheaviour.

I have checked in a fix for QPID-607 and for now at least our M2
continuous build is passing all tests.

How does the RH continuous build look now?

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 21/09/2007, Carl Trieloff <cc...@redhat.com> wrote:

> this is our current failure...
>
> Running org.apache.qpid.test.unit.client.forwardall.CombinedTest
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.024
> sec <<< FAILURE!
>
> If it is different for you I will post the log.

No, we see this one too...

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Carl Trieloff <cc...@redhat.com>.

>
> This should be fixed now.
>
> However we are still seeing some other occasional failures on our
> continuous build so this isn't over yet...
>
>   

this is our current failure...

Running org.apache.qpid.test.unit.client.forwardall.CombinedTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.024 
sec <<< FAILURE!

If it is different for you I will post the log.

Carl.

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 20/09/2007, Robert Greig <ro...@gmail.com> wrote:

> > $ cat org.apache.qpid.server.AMQBrokerManagerMBeanTest.txt
> > -------------------------------------------------------------------------------
> > Test set: org.apache.qpid.server.AMQBrokerManagerMBeanTest
> > -------------------------------------------------------------------------------
> > Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006
> > sec <<< FAILURE!
> > testExchangeOperations(org.apache.qpid.server.AMQBrokerManagerMBeanTest)
> >  Time elapsed: 0.003 sec  <<< ERROR!
> > java.lang.NullPointerException

This should be fixed now.

However we are still seeing some other occasional failures on our
continuous build so this isn't over yet...

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Robert Greig <ro...@gmail.com>.

On 20/09/2007, Nuno Santos <ns...@redhat.com> wrote:

> Another surefire error from the latest M2 build:
>
> $ cat org.apache.qpid.server.AMQBrokerManagerMBeanTest.txt
> -------------------------------------------------------------------------------
> Test set: org.apache.qpid.server.AMQBrokerManagerMBeanTest
> -------------------------------------------------------------------------------
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006
> sec <<< FAILURE!
> testExchangeOperations(org.apache.qpid.server.AMQBrokerManagerMBeanTest)
>  Time elapsed: 0.003 sec  <<< ERROR!
> java.lang.NullPointerException

This is not brand new; it has been happening on our continuous build
for several days now. We had another more pressing issue - hanging
builds due to a race condition which I have just applied a fix for on
both M2 and M2.1 branches.

Hopefully we'll get round to fixing the above tomorrow.

RG

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Nuno Santos <ns...@redhat.com>.

Carl Trieloff wrote:
> ------------------------------------------------------------------------------- 
> 
> Test set: org.apache.qpid.test.client.QueueBrowserTest
> ------------------------------------------------------------------------------- 

Another surefire error from the latest M2 build:

$ cat org.apache.qpid.server.AMQBrokerManagerMBeanTest.txt
-------------------------------------------------------------------------------
Test set: org.apache.qpid.server.AMQBrokerManagerMBeanTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006 
sec <<< FAILURE!
testExchangeOperations(org.apache.qpid.server.AMQBrokerManagerMBeanTest) 
  Time elapsed: 0.003 sec  <<< ERROR!
java.lang.NullPointerException
         at 
org.apache.qpid.server.AMQBrokerManagerMBeanTest.setUp(AMQBrokerManagerMBeanTest.java:89)
         at junit.framework.TestCase.runBare(TestCase.java:125)
         at junit.framework.TestResult$1.protect(TestResult.java:106)
         at junit.framework.TestResult.runProtected(TestResult.java:124)
         at junit.framework.TestResult.run(TestResult.java:109)
         at junit.framework.TestCase.run(TestCase.java:118)
         at junit.framework.TestSuite.runTest(TestSuite.java:208)
         at junit.framework.TestSuite.run(TestSuite.java:203)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at 
org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:210)
         at 
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:135)
         at 
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:122)
         at org.apache.maven.surefire.Surefire.run(Surefire.java:129)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at 
org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:225)
         at 
org.apache.maven.surefire.booter.SurefireBooter.run(SurefireBooter.java:139)
         at 
org.apache.maven.plugin.surefire.SurefirePlugin.execute(SurefirePlugin.java:376)
         at 
org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:412)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:534)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalWithLifecycle(DefaultLifecycleExecutor.java:475)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:454)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:306)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:273)
         at 
org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:140)
         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:322)
         at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:115)
         at org.apache.maven.cli.MavenCli.main(MavenCli.java:256)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at 
org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
         at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
         at 
org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
         at org.codehaus.classworlds.Launcher.main(Launcher.java:375)

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Carl Trieloff <cc...@redhat.com>.

-------------------------------------------------------------------------------
Test set: org.apache.qpid.test.client.QueueBrowserTest
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.084 sec <<< FAILURE!
testDummyinVMTestCase(org.apache.qpid.test.client.QueueBrowserTest)  Time elapsed: 60.021 sec  <<< ERROR!
javax.jms.JMSException: Error creating connection: State not achieved within permitted time.  Current state AMQState: id = 6 name: CONNECTION_CLOSED, desired state: AMQState: id = 4 name: CONNECTION_OPEN
	at org.apache.qpid.client.AMQConnectionFactory.createConnection(AMQConnectionFactory.java:270)
	at org.apache.qpid.test.client.QueueBrowserTest.setUp(QueueBrowserTest.java:59)
	at junit.framework.TestCase.runBare(TestCase.java:125)
	at junit.framework.TestResult$1.protect(TestResult.java:106)
	at junit.framework.TestResult.runProtected(TestResult.java:124)
	at junit.framework.TestResult.run(TestResult.java:109)
	at junit.framework.TestCase.run(TestCase.java:118)
	at junit.framework.TestSuite.runTest(TestSuite.java:208)
	at junit.framework.TestSuite.run(TestSuite.java:203)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:210)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:135)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:122)
	at org.apache.maven.surefire.Surefire.run(Surefire.java:129)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:225)
	at org.apache.maven.surefire.booter.SurefireBooter.run(SurefireBooter.java:139)
	at org.apache.maven.plugin.surefire.SurefirePlugin.execute(SurefirePlugin.java:376)
	at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:412)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:534)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalWithLifecycle(DefaultLifecycleExecutor.java:475)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:454)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:306)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:273)
	at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:140)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:322)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:115)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:256)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
	at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
	at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
	at org.codehaus.classworlds.Launcher.main(Launcher.java:375)

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Gordon Sim <gs...@redhat.com>.

Carl Trieloff wrote:
> 
>> Hi, It would be most helpful in resolving these intermittent failures
>> if people seeing these problems could post the Surefire reports.
> 
> I need wait for Nuno, to get something on my user-id fixed at which 
> point I will post more info.
> 
> This is the current failure.
> 
> Running org.apache.qpid.test.client.QueueBrowserTest
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.084 
> sec <<< FAILURE!
> 
> 
That is from M2, after Robert merged in your patch from M2.1 (I 
believe). We also had cruisecontrol hang on 
org.apache.qpid.test.unit.basic.SelectorTest yesterday. I don't have any 
details I'm afraid, on restarting cruise control it didn't happen again.

Re: Intermittent Test Failures [was: Re: M2 - let us try another "final" build]

Posted by Carl Trieloff <cc...@redhat.com>.

> Hi, It would be most helpful in resolving these intermittent failures
> if people seeing these problems could post the Surefire reports.

I need wait for Nuno, to get something on my user-id fixed at which 
point I will post more info.

This is the current failure.

Running org.apache.qpid.test.client.QueueBrowserTest
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 60.084 
sec <<< FAILURE!