You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Pete Fawcett <pe...@fawcett.co.uk> on 2022/01/06 20:48:11 UTC

Qpid C++ Broker Segfault

I have recently encountered a problem with the Qpid C++ broker failing due
to a segfault.

I have done some analysis and believe I have found the cause.  I have a
proposed workaround,  but I also have a question about the desired
behaviour and wanted to see what others thought before submitting a
ticket/pull request.

In essence, the problem occurs when a session has 'pending' responses and
then a link error occurs.  When a link is re-established an attempt is made
to send the pending responses but, by that stage, they are invalid and a
NULL link pointer within a delivery object causes the segfault.

Here is some more detail. (Feel free to skip to the "Questions" below if
you don't want to wade through the detail yet)

The problem was initially found in broker version 1.39.0 running on CentOS
7 though I don't think the problem is platform specific. I have reproduced
it using the latest source from git.

The problem was shown up by a new AMQP 1.0 client program that I have been
developing.  There are two particular features of the client that I believe
contributed to the problem.  Firstly, it uses AMQP 1.0, and secondly it
sends messages "asynchronously" i.e. it sends multiple messages without
necessarily waiting for each one to be individually acknowledged.

The scenario being tested is sending messages to a broker.  The messages
are routed to a destination queue.  The queue has a qpid.max_size specified
(in this case 10,000 bytes) and a qpid.policy_type of 'reject'.

There are no subscribers to the queue so the messages will build up until
the limit is reached and should then be rejected.

The client program sends multiple messages with 1,000 bytes of payload
(1,087 bytes total size when on a queue)

The first nine messages get routed to the destination queue but the tenth
one won't fit.  At this point the broker returns a "Link Error" to the
client (with 'amqp:precondition-failed' / 'resource-limit-exceeded') and,
crucially, closes the link.

When the link error occurs, the client program tries to re-establish the
link and to re-send any unacknowledged messages.  If nothing has changed at
the broker end then the same link error occurs.  The client repeatedly
tries to re-connect (with a BackOff delay).

The next stage of the test is to remove some messages from the destination
queue (using the purge facility on the QMF console).  The hope being that
the client will successfully reconnect and send some more messages.

What actually happened was that, once the link is re-established, and data
starts to flow again, then the broker crashes with a SegFault.

The actual SegFault is in the proton library at pni_add_tpwork(pn_delivery_t
*delivery)
<https://github.com/apache/qpid-proton/blob/main/c/src/core/engine.c#L740> and
is caused by the 'link' pointer, within the passed 'delivery' being NULL.

The reason it is NULL is due, I think, to a logic error in the Session
handling in the AMQP 1.0 plugin Session.cpp
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp>

The session handling has some complex tasks to deal with, including
ensuring that messages are sent and received on the correct threads.

After tracing / debugging I think the following happens:


   - When an incoming message has been processed, and an 'accepted' state
   needs to be sent to the originator, a pointer to the relevant
   'delivery' object is placed in the 'pending' set. pending_accept()
   <https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L649>
   - The 'delivery' pointer is later removed from the pending set when
Session::accepted()

   <https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L677>is
   called with 'sync=True' and the delivery is accepted with a call to
   pn_delivery_update().
   - However, when the link error occurs there are still some delivery
   object pointers in the pending set.
   - After the queue is purged (and possibly after the link has been
   re-established - I'm not sure) then Session::accepted()
   <https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L677>
is
   called for those deliveries in the pending set, but this time with
   'sync=False' which causes the pointers to be moved to the 'completed'
   deque.  The problem is that the *delivery objects are no longer valid*.
   It looks like they have been re-initialised to 'empty' values. Specifically
   the 'link' pointer is NULL.  I am guessing that they have been cleared by
   some C proton processing related to the link being cleared and
   re-established.  Since the 'pending' set holds bare pointers to the
   delivery objects the proton library doesn't 'know' they are in use - and
   anyway the link they were referring to is no longer there.
   - With the invalid delivery object having being moved from the 'pending'
   set to the 'completed' deque, the stage is set for the final act when,
   during processing of the 'completed' deque within Session::dispatch()
   <https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L748>
an
   'empty' delivery object is passed to Session::accepted() and results in the
   Segfault at pni_add_tpwork()


*Questions:*

Firstly, a simple work around is to check the link pointer when moving
delivery pointers from 'pending' to completed and discarding them if the
'link' is NULL . Does this suffice?

Secondly, does the link need to be cleared just because a message is too
big for the queue? Can it not be signalled back to the sender in a less
drastic manner? (Dropping the link means it is likely for numerous messages
to be repeatedly delivered when using "at least once" delivery)

Thoughts and help would be very much appreciated.

Thanks

Pete

Re: Qpid C++ Broker Segfault

Posted by Gordon Sim <gs...@redhat.com>.
On Tue, Feb 8, 2022 at 8:11 PM Pete Fawcett <pe...@fawcett.co.uk> wrote:
> Would you prefer that I submit a Pull Request for the current fixes or wait
> until further changes are made?

Either way


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid C++ Broker Segfault

Posted by Pete Fawcett <pe...@fawcett.co.uk>.
On Fri, 14 Jan 2022 at 13:59, Gordon Sim <gs...@redhat.com> wrote:

> On Fri, Jan 14, 2022 at 10:36 AM Pete Fawcett <pe...@fawcett.co.uk> wrote:
> > I tried this but it didn't, initially, fix the problem.
> > It turns out that the current exception handling is causing the link to
> be
> > closed from within Connection object doDeliveryUpdated
> > <
> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Connection.cpp#L675
> >
> > Calling Session::abort_pending() from there meant changing
> abort_pending()
> > to be a public function, which makes me wonder if it is the right thing
> to
> > be doing.
> > Making the change, and calling abort_pending() before closing the link in
> > doDeliveryUpdated, did remove the pending items and so prevent the
> Segfault.
>
> You could also put a try/catch in Session::readable[1], and call
> abort_pending before rethrowing. (Or later, send back a nack for the
> message instead of rethrowing for certain errors).
>
> [1]
> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L697
>
>
Thank you for that helpful suggestion, Gordon.
Adding the suggested try/catch allowed the abort_pending to be called in
the correct place and so cleared up the pending deliveries before they
could cause a segfault.

I would like to go on now, as originally stated, and change the behaviour
when the 'reject' queue is full, so that the message is 'returned' rather
than dropping the link. (I believe this is what happens when using AMQP
0.10)

Would you prefer that I submit a Pull Request for the current fixes or wait
until further changes are made?

Thanks

Pete

Re: Qpid C++ Broker Segfault

Posted by Gordon Sim <gs...@redhat.com>.
On Fri, Jan 14, 2022 at 10:36 AM Pete Fawcett <pe...@fawcett.co.uk> wrote:
> I tried this but it didn't, initially, fix the problem.
> It turns out that the current exception handling is causing the link to be
> closed from within Connection object doDeliveryUpdated
> <https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Connection.cpp#L675>
> Calling Session::abort_pending() from there meant changing abort_pending()
> to be a public function, which makes me wonder if it is the right thing to
> be doing.
> Making the change, and calling abort_pending() before closing the link in
> doDeliveryUpdated, did remove the pending items and so prevent the Segfault.

You could also put a try/catch in Session::readable[1], and call
abort_pending before rethrowing. (Or later, send back a nack for the
message instead of rethrowing for certain errors).

[1] https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L697


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid C++ Broker Segfault

Posted by Pete Fawcett <pe...@fawcett.co.uk>.
On Tue, 11 Jan 2022 at 09:35, Pete Fawcett <pe...@fawcett.co.uk> wrote:

> Hi Gordon
>
> Thank you for the helpful response
>
> On Fri, 7 Jan 2022 at 11:18, Gordon Sim <gs...@redhat.com> wrote:
>
>> On Thu, Jan 6, 2022 at 8:56 PM Pete Fawcett <pe...@fawcett.co.uk> wrote:
>> > *Questions:*
>> >
>> > Firstly, a simple work around is to check the link pointer when moving
>> > delivery pointers from 'pending' to completed and discarding them if the
>> > 'link' is NULL . Does this suffice?
>>
>> I think that would probably work. However it would be nicer to
>> actually remove them from pending before they were invalidated. Does
>> calling abort_pending() for the link before closing it in
>>
>> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L764
>> prevent the segfault?
>>
>
> I will try this.  Thank you for pointing out where to remove the pending
> items, I hadn't managed to figure that out.
>

I tried this but it didn't, initially, fix the problem.
It turns out that the current exception handling is causing the link to be
closed from within Connection object doDeliveryUpdated
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Connection.cpp#L675>
Calling Session::abort_pending() from there meant changing abort_pending()
to be a public function, which makes me wonder if it is the right thing to
be doing.
Making the change, and calling abort_pending() before closing the link in
doDeliveryUpdated, did remove the pending items and so prevent the Segfault.

As well as being concerned about changing functions to be part of the
public interface, I would still have concerns that other cases may arise.
My instinct would be to keep my initial suggestion of checking for invalid
pending deliveries before they are actioned.  This would be in addition to
these attempts to clear the deliveries when the link closes.

I haven't yet looked at the prospect of not closing the link for the
particular case of a full queue, but I think the handling of the link error
needs to be "safe", regardless of other possible changes.


>
>
>
>> > Secondly, does the link need to be cleared just because a message is too
>> > big for the queue? Can it not be signalled back to the sender in a less
>> > drastic manner? (Dropping the link means it is likely for numerous
>> messages
>> > to be repeatedly delivered when using "at least once" delivery)
>>
>> The message could be rejected. (I think that would involve changing
>> the exception handling at
>>
>> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L948
>> to not throw but to simply reject the message).
>>
>> I will also look at this, though I think the message should be "returned"
> rather than "rejected"
>
> It only just occurred to me that the reason there are items still pending
> is probably due to flow control - I had been thinking it was a data race -
> so I'll need to keep that in mind when attempting to return a status to the
> sender.
>
> Thanks again
>
> Pete
>
>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
>> For additional commands, e-mail: users-help@qpid.apache.org
>>
>>

Re: Qpid C++ Broker Segfault

Posted by Pete Fawcett <pe...@fawcett.co.uk>.
Hi Gordon

Thank you for the helpful response

On Fri, 7 Jan 2022 at 11:18, Gordon Sim <gs...@redhat.com> wrote:

> On Thu, Jan 6, 2022 at 8:56 PM Pete Fawcett <pe...@fawcett.co.uk> wrote:
> > *Questions:*
> >
> > Firstly, a simple work around is to check the link pointer when moving
> > delivery pointers from 'pending' to completed and discarding them if the
> > 'link' is NULL . Does this suffice?
>
> I think that would probably work. However it would be nicer to
> actually remove them from pending before they were invalidated. Does
> calling abort_pending() for the link before closing it in
>
> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L764
> prevent the segfault?
>

I will try this.  Thank you for pointing out where to remove the pending
items, I hadn't managed to figure that out.



> > Secondly, does the link need to be cleared just because a message is too
> > big for the queue? Can it not be signalled back to the sender in a less
> > drastic manner? (Dropping the link means it is likely for numerous
> messages
> > to be repeatedly delivered when using "at least once" delivery)
>
> The message could be rejected. (I think that would involve changing
> the exception handling at
>
> https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L948
> to not throw but to simply reject the message).
>
> I will also look at this, though I think the message should be "returned"
rather than "rejected"

It only just occurred to me that the reason there are items still pending
is probably due to flow control - I had been thinking it was a data race -
so I'll need to keep that in mind when attempting to return a status to the
sender.

Thanks again

Pete


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Qpid C++ Broker Segfault

Posted by Gordon Sim <gs...@redhat.com>.
On Thu, Jan 6, 2022 at 8:56 PM Pete Fawcett <pe...@fawcett.co.uk> wrote:
> *Questions:*
>
> Firstly, a simple work around is to check the link pointer when moving
> delivery pointers from 'pending' to completed and discarding them if the
> 'link' is NULL . Does this suffice?

I think that would probably work. However it would be nicer to
actually remove them from pending before they were invalidated. Does
calling abort_pending() for the link before closing it in
https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L764
prevent the segfault?

> Secondly, does the link need to be cleared just because a message is too
> big for the queue? Can it not be signalled back to the sender in a less
> drastic manner? (Dropping the link means it is likely for numerous messages
> to be repeatedly delivered when using "at least once" delivery)

The message could be rejected. (I think that would involve changing
the exception handling at
https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L948
to not throw but to simply reject the message).


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org