You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Mark Payne (Jira)" <ji...@apache.org> on 2022/03/25 18:52:00 UTC

[jira] [Updated] (NIFI-9433) Load Balancer hangs

     [ https://issues.apache.org/jira/browse/NIFI-9433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Payne updated NIFI-9433:
-----------------------------
    Labels: connections hang load-balanced load-balanced-connections load-balancing negative-queue-size  (was: connections load-balanced-connections load-balancing)

> Load Balancer hangs
> -------------------
>
>                 Key: NIFI-9433
>                 URL: https://issues.apache.org/jira/browse/NIFI-9433
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.15.0
>            Reporter: Mark Bean
>            Assignee: Mark Payne
>            Priority: Critical
>              Labels: connections, hang, load-balanced, load-balanced-connections, load-balancing, negative-queue-size
>             Fix For: 1.16.0, 1.15.1
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Simplified scenario to demonstrate problem:
> A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced connection -> UpdateAttribute. And, unconnected to the first two processors, Funnel #1 -> non-load-balanced Connection -> Funnel #2.
> GenerateFlowFile is scheduled to run on Primary Node only. It is started. This causes the connection to be very busy load balancing (round robin). Then, the connection between the two funnels is removed.
> Immediately, an error is thrown, and the flow gets stuck in a state of constantly throwing errors indicating that a connection (the one just deleted) does not exist and cannot be balanced.
> It is unclear why this connection is being considered by the load balancer at all.
> The sequence of errors include the following:
> Primary Node reports 
> 2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> 2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ]
> java.lang.RuntimeException: Cannot create negative queue size
> The above may be a symptom of subsequent errors in the log:
> Primary Node reports:
> 2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer <host:port>
> java.io.IOException: Failed to negotiate Protocol Version with Peer <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT response got back a response of 33.
> Non-Primary Node reports:
> 2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer<fqdn/IP:port>
> java.io.IOException: Expected to receive Transaction Completion Indicator from Peer <fqdn> but instead received a value of 1
> The highly concerning part is this error which indicates a Connection which was not scheduled to load balance was attempting to receive a FlowFile.
> Non-Primary Node reports:
> 2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from Peer <fqdn> for Connection with ID <uuid> but no connection exists with that ID.
> Note the that <uuid> value in this message corresponds to the Connection that was removed causing the errors to begin. Should the above message ever occur? Does the load balancer ever consider Connections which are configured as "Do not load balance"
> Users have also reported that FlowFiles have been load balanced from one Connection to another, unrelated Connection on the other Node. (This is still being verified.)
> Finally, on the UI the load-balanced connection indicates it is actively load balancing some number (206 in this case) of FlowFiles currently in the connection. And, attempts to "list queue" on this connection show no FlowFiles. Presumably they are being held by the load balancer and are inaccessible in the queue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)