You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Marcel Meulemans (JIRA)" <ji...@apache.org> on 2018/05/17 08:36:00 UTC

[jira] [Commented] (DISPATCH-966) Qpid dispatch unstable inter-router connections

    [ https://issues.apache.org/jira/browse/DISPATCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478735#comment-16478735 ] 

Marcel Meulemans commented on DISPATCH-966:
-------------------------------------------

Sorry for the show response, but now I finally got around to doing some follow up on this. Turns out the python exceptions are a side effect, not the actual problem (however I'll try to reproduce the stack trace later if only to improve the python response to the situation).

The actual problem is cause by this code (as far as I can see): [https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1168] ... In my situation with 10000 clients (each with two unique addresses), the MAU messages exchanged between routers can become quite large, so large that the limit set on the number of msg->content->buffers (qd_message_Q2_holdoff_should_block) is hit. This holdoff is unblocked when buffers are freed up by sending them out, but as the MAU message is not being sent out the holdoff is never unblocked. As a consequence all communication on this link comes to a halt (some message still arrive on the link until the credit is used up, but are never processed by the router code) and eventually the network breaks down. It seems to me that this blocking should not occur on messages that are not going to be send out. I verified my theory by increasing QD_QLIMIT_Q2_UPPER and observing that the problem goes away, but that is of course not a correct solution. I don't know enough about the router internals to propose a solution other than the qd_message_Q2_holdoff_should_block implementation ([https://github.com/apache/qpid-dispatch/blob/master/src/message.c#L1950)] should probably also take into account that not all messages are sent out to other destinations.

Btw, I have not been able to figure out how this leads to the initial error "Deliveries to a multicast address must be pre-settled". What I did notice it that proton trace logging is showing inconsistent settlement flag for messages that are split over multiple transfer frames (see [^inconsistent-settlement.log]).

> Qpid dispatch unstable inter-router connections
> -----------------------------------------------
>
>                 Key: DISPATCH-966
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-966
>             Project: Qpid Dispatch
>          Issue Type: Bug
>          Components: Routing Engine
>    Affects Versions: 1.0.1
>            Reporter: Marcel Meulemans
>            Assignee: Ted Ross
>            Priority: Blocker
>             Fix For: 1.1.0
>
>         Attachments: inconsistent-settlement.log, qdrouterd-unsettled-true.log, qdrouterd.conf, qdrouterd.log, router-unsettled-true.dump, router.dump
>
>
> I am running a three node fully connected mesh of dispatch routers with 10000 attached clients and I am seeing some unstable inter-router connections (I am sending around 1000 small, less than 1K, messages per second through the network). The inter-router connections fail every so many seconds with the message:
> {{Connection to router-2:55672 failed: amqp:session:invalid-field sequencing error, expected delivery-id 7, got 6}}
> (the numbers 7 and 6 differ per connection loss)
> In wireshark, using the attached tcpdump capture, I can see that every time before the inter router connection is dropped, therw is a rejected disposition with the message:
> {{Condition: qd:forbidden}}
> {{Description: Deliveries to a multicast address must be pre-settled}}
> The routers are connected as follows:
>  * router-0 -> router-1
>  * router-0 -> router-2
>  * router-1 -> router-2
> The routers are running as a docker container (debian stretch) on google compute engine machines (every router on a separate node).
> Attached are:
>  * my qdrouter.conf (from one of the routers)
>  * a log snippet from router-0 at debug level from connection drop to connection re-established to connection drop again.
>  * a tcpdump capture of the inter-router connection between router-0 and router-1 during which several of the failures occur
> Versions:
>  * qpid-dispatch@1.0.1-rc1
>  * qpid-proton@0.20.0
>  
> [^qdrouterd.log]
> [^qdrouterd.conf]
> [^router.dump]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org