You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by ddewaele <dd...@gmail.com> on 2017/05/18 11:30:06 UTC

Nifi Cluster fails to disconnect node when node was killed

Hi,

I have a NiFi cluster up and running and I'm testing various failover
scenarios.

I have 2 nodes in the cluster :

- centos-a : Coordinator node / primary
- centos-b : Cluster node

I noticed in 1 of the scenarios when I killed the Cluster Coordinator node,
that the following happened :

centos-b couldn't contact the coordinator anymore and became the new
coordinator / primary node. (as expected) :

Failed to send heartbeat due to:
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message
to Cluster Coordinator due to: java.net.ConnectException: Connection refused
(Connection refused)
This node has been elected Leader for Role 'Primary Node'
This node has been elected Leader for Role 'Cluster Coordinator'

When attempting to access the UI on centos-b, I got the following error :

2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2]
o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET
/nifi-api/flow/current-user to centos-a:8080 due to {}

If my understanding is correct, NiFi will try to replicate to connected
nodes in the cluster. Here, centos-a was killed a while back and should have
been disconnected, but as far as NiFi was concerned it was still connected.

As a result I cannot access the UI anymore (due to the replication error),
but I can lookup the cluster info via the REST API. And sure enough, it
still sees centos-a as being CONNECTED.

{
    "cluster": {
        "generated": "11:20:13 UTC",
        "nodes": [
            {
                "activeThreadCount": 0,
                "address": "centos-b",
                "apiPort": 8080,
                "events": [
                    {
                        "category": "INFO",
                        "message": "Node Status changed from CONNECTING to
CONNECTED",
                        "timestamp": "05/18/2017 11:17:31 UTC"
                    },
                    {
                        "category": "INFO",
                        "message": "Node Status changed from [Unknown Node]
to CONNECTING",
                        "timestamp": "05/18/2017 11:17:27 UTC"
                    }
                ],
                "heartbeat": "05/18/2017 11:20:09 UTC",
                "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9",
                "nodeStartTime": "05/18/2017 11:17:25 UTC",
                "queued": "8,492 / 13.22 MB",
                "roles": [
                    "Primary Node",
                    "Cluster Coordinator"
                ],
                "status": "CONNECTED"
            },
            {
                "address": "centos-a",
                "apiPort": 8080,
                "events": [],
                "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892",
                "roles": [],
                "status": "CONNECTED"
            }
        ]
    }
}

When centos-a was brought back online, i noticed the following status change
:

Status of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15] to
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19]

So it went from connected -> connecting.

It clearly missed the disconnected step here.

When shutting down the centos-a node using nifi.sh stop, it goes into the
DISCONNECTED state :

Status of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12] to
NodeConnectionStatus[nodeId=centos-a:8
080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect
Reason=Node was Shutdown, updateId=13]

How can I debug this further, and can somebody provide some additional
insights ? I have seen nodes getting disconnected due to missing heartbeats

tatus of centos-a:8080 changed from
NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10] to
NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect
Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from
node in 41 seconds, updateId=11]

But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it
is CONNECTED, despite not having received heartbeats in ages.

Any ideas ?



--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Matt Gilman <ma...@gmail.com>.

Thank you for the confirmation. I've filed this JIRA [1] to track the issue.

Matt

[1] https://issues.apache.org/jira/browse/NIFI-3933

On Thu, May 18, 2017 at 8:21 AM, ddewaele <dd...@gmail.com> wrote:

> Thanks for the response.
>
> When killing a non-coordinator node, it does take 8 * 5 seconds before I
> see
> this :
>
> nifi-app.log:2017-05-18 12:04:29,644 INFO [Heartbeat Monitor Thread-1]
> o.a.n.c.c.node.NodeClusterCoordinator Status of centos-b:8080 changed from
> NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, updateId=26]
> to
> NodeConnectionStatus[nodeId=centos-b:8080, state=DISCONNECTED, Disconnect
> Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat
> from
> node in 43 seconds, updateId=27]
>
> When killing the coordinator node, the newly appointed coordinator doesn't
> seem to detect the heartbeat timeout.
>
> I'll see if I can enable the debug logging.
>
> My Nifi runs inside a KVM. KVM includes 3 seperate VMs. External zookeeper
> (replicated mode) running on the 3 VMs, and 2 VMs used for NiFi nodes.
>
> I have the same issue in a dockerized environment
>
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942p1948.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

Thanks for the response. 

When killing a non-coordinator node, it does take 8 * 5 seconds before I see
this :

nifi-app.log:2017-05-18 12:04:29,644 INFO [Heartbeat Monitor Thread-1]
o.a.n.c.c.node.NodeClusterCoordinator Status of centos-b:8080 changed from
NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, updateId=26] to
NodeConnectionStatus[nodeId=centos-b:8080, state=DISCONNECTED, Disconnect
Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat from
node in 43 seconds, updateId=27]

When killing the coordinator node, the newly appointed coordinator doesn't
seem to detect the heartbeat timeout.

I'll see if I can enable the debug logging.

My Nifi runs inside a KVM. KVM includes 3 seperate VMs. External zookeeper
(replicated mode) running on the 3 VMs, and 2 VMs used for NiFi nodes.

I have the same issue in a dockerized environment






--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1948.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

Worked like a charm, thanks!

On Fri, May 19, 2017 at 2:17 PM, ddewaele <dd...@gmail.com> wrote:

> Sorry, payload should also include the nodeId
>
> curl -v -X PUT -d
> "{\"node\":{\"nodeId\":\"b89e8418-4b7f-4743-bdf4-
> 4a08a92f3892\",\"status\":\"DISCONNECTING\"}}"
> -H "Content-Type: application/json"
> http://192.168.122.141:8080/nifi-api/controller/cluster/
> nodes/b89e8418-4b7f-4743-bdf4-4a08a92f3892
>
>
>
> --
> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942p1982.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

Sorry, payload should also include the nodeId

curl -v -X PUT -d
"{\"node\":{\"nodeId\":\"b89e8418-4b7f-4743-bdf4-4a08a92f3892\",\"status\":\"DISCONNECTING\"}}"
-H "Content-Type: application/json"
http://192.168.122.141:8080/nifi-api/controller/cluster/nodes/b89e8418-4b7f-4743-bdf4-4a08a92f3892 



--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1982.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

You should be able to put it into DISCONNECTED mode by doing the following
call :

curl -v -X PUT -d "{\"node\":{\"status\":\"DISCONNECTING\”}}” -H
"Content-Type: application/json"
http://192.168.122.141:8080/nifi-api/controller/cluster/nodes/b89e8418-4b7f-4743-bdf4-4a08a92f3892 

It should respond with an HTTP 200 and a message saying it went to state
DISCONNECTED.

That way you can access the GUI again and delete the node from the cluster
if you want to.

Tested this workaround with 1.2.0 and works.



--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1981.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Joe Witt <jo...@gmail.com>.

I see.  Yeah that sounds like something the jira gilman mentioned will
resolve.  Thanks for clarifying.  I'm sure that jira will be addressed soon.

On May 19, 2017 1:06 PM, "Neil Derraugh" <
neil.derraugh@intellifylearning.com> wrote:

> That's the whole problem from my perspective: it stays CONNECTED.  It
> never becomes DISCONNECTED.  You can't delete it from the API in 1.2.0.
>
> That's why I said it was a single point of failure.  The exact semantics
> of calling it a single point of failure might be debatable, but the fact
> that the cluster can't be modified and/or gracefully shutdown (afaik) is
> what I was referring to.
>
> On Fri, May 19, 2017 at 12:40 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> I believe at the state you describe that down node is now considered
>> disconnected.  The cluster behavior prohibits you from making changes when
>> it knows not all members of the cluster cannot honor the change.  If you
>> are sure you want to make the changes anyway and move on without that node
>> you should be able to remove it/delete it from the cluster.  Now you have a
>> cluster of two connected nodes and you can make changes.
>>
>> On May 19, 2017 12:23 PM, "Neil Derraugh" <neil.derraugh@intellifylearni
>> ng.com> wrote:
>>
>>> That's fair.  But for the sake of total clarity on my own part, after
>>> one of these disaster scenarios with a newly quorum-elected primary things
>>> cannot be driven through the UI and at least through parts the REST API.
>>>
>>> I just ran through the following.  We have 3 nodes A, B, C with A
>>> primary, and A becomes unreachable without first disconnecting.  Then B and
>>> C may (I haven't verified) continue operating the flow they had in the
>>> clusters' last "good" state.  But they do elect a new primary, as per the
>>> REST nifi-api/controller/cluster response.  But now the flow can't be
>>> changed, and in some cases it can't be reported on either, i.e. some GETs
>>> fail, like nifi-api/flow/process-groups/root.
>>>
>>> Are we describing the same behavior?
>>>
>>> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>>> If there is no longer a quorum then we cannot drive things from the UI
>>>> but the cluster remaining is in tact from a functioning point of view
>>>> other than being able to assign a primary to handle the one-off items.
>>>>
>>>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
>>>> <ne...@intellifylearning.com> wrote:
>>>> > Hi Joe,
>>>> >
>>>> > Maybe I'm missing something, but if the primary node suffers a network
>>>> > partition or container/vm/machine loss or becomes otherwise
>>>> unreachable then
>>>> > the cluster is unusable, at least from the UI.
>>>> >
>>>> > If that's not so please correct me.
>>>> >
>>>> > Thanks,
>>>> > Neil
>>>> >
>>>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:
>>>> >>
>>>> >> Neil,
>>>> >>
>>>> >> Want to make sure I understand what you're saying.  What are stating
>>>> >> is a single point of failure?
>>>> >>
>>>> >> Thanks
>>>> >> Joe
>>>> >>
>>>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>>>> >> <ne...@intellifylearning.com> wrote:
>>>> >> > Thanks for the insight Matt.
>>>> >> >
>>>> >> > It's a disaster recovery issue.  It's not something I plan on
>>>> doing on
>>>> >> > purpose.  It seems it is a single point of failure unfortunately.
>>>> I can
>>>> >> > see
>>>> >> > no other way to resolve the issue other than to blow everything
>>>> away and
>>>> >> > start a new cluster.
>>>> >> >
>>>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <
>>>> matt.c.gilman@gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Neil,
>>>> >> >>
>>>> >> >> Disconnecting a node prior to removal is the correct process. It
>>>> >> >> appears
>>>> >> >> that the check was lost going from 0.x to 1.x. Folks reported
>>>> this JIRA
>>>> >> >> [1]
>>>> >> >> indicating that deleting a connected node did not work. This
>>>> process
>>>> >> >> does
>>>> >> >> not work because the node needs to be disconnected first. The
>>>> JIRA was
>>>> >> >> addressed by restoring the check that a node is disconnected
>>>> prior to
>>>> >> >> deletion.
>>>> >> >>
>>>> >> >> Hopefully the JIRA I filed earlier today [2] will address the
>>>> phantom
>>>> >> >> node
>>>> >> >> you were seeing. Until then, can you update your workaround to
>>>> >> >> disconnect
>>>> >> >> the node in question prior to deletion?
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >>
>>>> >> >> Matt
>>>> >> >>
>>>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>>>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>>>> >> >>
>>>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>>>> >> >> <ne...@intellifylearning.com> wrote:
>>>> >> >>>
>>>> >> >>> Pretty sure this is the problem I was describing in the "Phantom
>>>> Node"
>>>> >> >>> thread recently.
>>>> >> >>>
>>>> >> >>> If I kill non-primary nodes the cluster remains healthy despite
>>>> the
>>>> >> >>> lost
>>>> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>>> >> >>>
>>>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a
>>>> new
>>>> >> >>> primary/cluster coordinator gets elected too.
>>>> >> >>>
>>>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer
>>>> support
>>>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>>>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
>>>> disconnected,
>>>> >> >>> current
>>>> >> >>> state = CONNECTED).  So right now I don't have a workaround and
>>>> have
>>>> >> >>> to kill
>>>> >> >>> all the nodes and start over.
>>>> >> >>>
>>>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <
>>>> markap14@hotmail.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Hello,
>>>> >> >>>>
>>>> >> >>>> Just looking through this thread now. I believe that I
>>>> understand the
>>>> >> >>>> problem. I have updated the JIRA with details about what I
>>>> think is
>>>> >> >>>> the
>>>> >> >>>> problem and a potential remedy for the problem.
>>>> >> >>>>
>>>> >> >>>> Thanks
>>>> >> >>>> -Mark
>>>> >> >>>>
>>>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
>>>> matt.c.gilman@gmail.com>
>>>> >> >>>> > wrote:
>>>> >> >>>> >
>>>> >> >>>> > Thanks for the additional details. They will be helpful when
>>>> >> >>>> > working
>>>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to
>>>> the
>>>> >> >>>> > active
>>>> >> >>>> > coordinator. This means that the coordinator effectively
>>>> heartbeats
>>>> >> >>>> > to
>>>> >> >>>> > itself. It appears, based on your log messages, that this is
>>>> not
>>>> >> >>>> > happening.
>>>> >> >>>> > Because no heartbeats were receive from any node, the lack of
>>>> >> >>>> > heartbeats
>>>> >> >>>> > from the terminated node is not considered.
>>>> >> >>>> >
>>>> >> >>>> > Matt
>>>> >> >>>> >
>>>> >> >>>> > Sent from my iPhone
>>>> >> >>>> >
>>>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com>
>>>> wrote:
>>>> >> >>>> >>
>>>> >> >>>> >> Found something interesting in the centos-b debug logging....
>>>> >> >>>> >>
>>>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes
>>>> over.
>>>> >> >>>> >> Notice how
>>>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat"
>>>> and
>>>> >> >>>> >> how
>>>> >> >>>> >> it still
>>>> >> >>>> >> sees centos-a as connected despite the fact that there are no
>>>> >> >>>> >> heartbeats
>>>> >> >>>> >> anymore.
>>>> >> >>>> >>
>>>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>>>> >> >>>> >> Thread-2]
>>>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
>>>> Active
>>>> >> >>>> >> Cluster
>>>> >> >>>> >> Coordinator
>>>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>>>> >> >>>> >> Thread-2]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
>>>> heartbeats
>>>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>>>> >> >>>> >> Thread-1]
>>>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
>>>> elected
>>>> >> >>>> >> Primary
>>>> >> >>>> >> Node
>>>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
>>>> heartbeats.
>>>> >> >>>> >> Will
>>>> >> >>>> >> not
>>>> >> >>>> >> disconnect any nodes due to lack of heartbeat
>>>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>>> heartbeat
>>>> >> >>>> >> from
>>>> >> >>>> >> centos-b:8080
>>>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >> >>>> >>
>>>> >> >>>> >> Calculated diff between current cluster status and node
>>>> cluster
>>>> >> >>>> >> status as
>>>> >> >>>> >> follows:
>>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Difference: []
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
>>>> Request-3]
>>>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
>>>> request
>>>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
>>>> length=2341
>>>> >> >>>> >> bytes)
>>>> >> >>>> >> from centos-b:8080 in 3 millis
>>>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>>>> >> >>>> >> 2017-05-18
>>>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>>>> >> >>>> >> 12:41:41,339;
>>>> >> >>>> >> send
>>>> >> >>>> >> took 8 millis
>>>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>>>> >> >>>> >> heartbeats
>>>> >> >>>> >> in
>>>> >> >>>> >> 93276 nanos
>>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>>> Request-4]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>>> heartbeat
>>>> >> >>>> >> from
>>>> >> >>>> >> centos-b:8080
>>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>>> Request-4]
>>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >> >>>> >>
>>>> >> >>>> >> Calculated diff between current cluster status and node
>>>> cluster
>>>> >> >>>> >> status as
>>>> >> >>>> >> follows:
>>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>>> state=CONNECTED,
>>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> >>>> >> state=CONNECTED,
>>>> >> >>>> >> updateId=42]]
>>>> >> >>>> >> Difference: []
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >>
>>>> >> >>>> >> --
>>>> >> >>>> >> View this message in context:
>>>> >> >>>> >>
>>>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu
>>>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>>>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>>> >> >>>> >> Nabble.com.
>>>> >> >>>>
>>>> >> >>>
>>>> >> >>
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

That's the whole problem from my perspective: it stays CONNECTED.  It never
becomes DISCONNECTED.  You can't delete it from the API in 1.2.0.

That's why I said it was a single point of failure.  The exact semantics of
calling it a single point of failure might be debatable, but the fact that
the cluster can't be modified and/or gracefully shutdown (afaik) is what I
was referring to.

On Fri, May 19, 2017 at 12:40 PM, Joe Witt <jo...@gmail.com> wrote:

> I believe at the state you describe that down node is now considered
> disconnected.  The cluster behavior prohibits you from making changes when
> it knows not all members of the cluster cannot honor the change.  If you
> are sure you want to make the changes anyway and move on without that node
> you should be able to remove it/delete it from the cluster.  Now you have a
> cluster of two connected nodes and you can make changes.
>
> On May 19, 2017 12:23 PM, "Neil Derraugh" <neil.derraugh@
> intellifylearning.com> wrote:
>
>> That's fair.  But for the sake of total clarity on my own part, after one
>> of these disaster scenarios with a newly quorum-elected primary things
>> cannot be driven through the UI and at least through parts the REST API.
>>
>> I just ran through the following.  We have 3 nodes A, B, C with A
>> primary, and A becomes unreachable without first disconnecting.  Then B and
>> C may (I haven't verified) continue operating the flow they had in the
>> clusters' last "good" state.  But they do elect a new primary, as per the
>> REST nifi-api/controller/cluster response.  But now the flow can't be
>> changed, and in some cases it can't be reported on either, i.e. some GETs
>> fail, like nifi-api/flow/process-groups/root.
>>
>> Are we describing the same behavior?
>>
>> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> If there is no longer a quorum then we cannot drive things from the UI
>>> but the cluster remaining is in tact from a functioning point of view
>>> other than being able to assign a primary to handle the one-off items.
>>>
>>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
>>> <ne...@intellifylearning.com> wrote:
>>> > Hi Joe,
>>> >
>>> > Maybe I'm missing something, but if the primary node suffers a network
>>> > partition or container/vm/machine loss or becomes otherwise
>>> unreachable then
>>> > the cluster is unusable, at least from the UI.
>>> >
>>> > If that's not so please correct me.
>>> >
>>> > Thanks,
>>> > Neil
>>> >
>>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:
>>> >>
>>> >> Neil,
>>> >>
>>> >> Want to make sure I understand what you're saying.  What are stating
>>> >> is a single point of failure?
>>> >>
>>> >> Thanks
>>> >> Joe
>>> >>
>>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>>> >> <ne...@intellifylearning.com> wrote:
>>> >> > Thanks for the insight Matt.
>>> >> >
>>> >> > It's a disaster recovery issue.  It's not something I plan on doing
>>> on
>>> >> > purpose.  It seems it is a single point of failure unfortunately.
>>> I can
>>> >> > see
>>> >> > no other way to resolve the issue other than to blow everything
>>> away and
>>> >> > start a new cluster.
>>> >> >
>>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <
>>> matt.c.gilman@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Neil,
>>> >> >>
>>> >> >> Disconnecting a node prior to removal is the correct process. It
>>> >> >> appears
>>> >> >> that the check was lost going from 0.x to 1.x. Folks reported this
>>> JIRA
>>> >> >> [1]
>>> >> >> indicating that deleting a connected node did not work. This
>>> process
>>> >> >> does
>>> >> >> not work because the node needs to be disconnected first. The JIRA
>>> was
>>> >> >> addressed by restoring the check that a node is disconnected prior
>>> to
>>> >> >> deletion.
>>> >> >>
>>> >> >> Hopefully the JIRA I filed earlier today [2] will address the
>>> phantom
>>> >> >> node
>>> >> >> you were seeing. Until then, can you update your workaround to
>>> >> >> disconnect
>>> >> >> the node in question prior to deletion?
>>> >> >>
>>> >> >> Thanks
>>> >> >>
>>> >> >> Matt
>>> >> >>
>>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>>> >> >>
>>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>>> >> >> <ne...@intellifylearning.com> wrote:
>>> >> >>>
>>> >> >>> Pretty sure this is the problem I was describing in the "Phantom
>>> Node"
>>> >> >>> thread recently.
>>> >> >>>
>>> >> >>> If I kill non-primary nodes the cluster remains healthy despite
>>> the
>>> >> >>> lost
>>> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>> >> >>>
>>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a
>>> new
>>> >> >>> primary/cluster coordinator gets elected too.
>>> >> >>>
>>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support
>>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
>>> disconnected,
>>> >> >>> current
>>> >> >>> state = CONNECTED).  So right now I don't have a workaround and
>>> have
>>> >> >>> to kill
>>> >> >>> all the nodes and start over.
>>> >> >>>
>>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <
>>> markap14@hotmail.com>
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Hello,
>>> >> >>>>
>>> >> >>>> Just looking through this thread now. I believe that I
>>> understand the
>>> >> >>>> problem. I have updated the JIRA with details about what I think
>>> is
>>> >> >>>> the
>>> >> >>>> problem and a potential remedy for the problem.
>>> >> >>>>
>>> >> >>>> Thanks
>>> >> >>>> -Mark
>>> >> >>>>
>>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
>>> matt.c.gilman@gmail.com>
>>> >> >>>> > wrote:
>>> >> >>>> >
>>> >> >>>> > Thanks for the additional details. They will be helpful when
>>> >> >>>> > working
>>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to
>>> the
>>> >> >>>> > active
>>> >> >>>> > coordinator. This means that the coordinator effectively
>>> heartbeats
>>> >> >>>> > to
>>> >> >>>> > itself. It appears, based on your log messages, that this is
>>> not
>>> >> >>>> > happening.
>>> >> >>>> > Because no heartbeats were receive from any node, the lack of
>>> >> >>>> > heartbeats
>>> >> >>>> > from the terminated node is not considered.
>>> >> >>>> >
>>> >> >>>> > Matt
>>> >> >>>> >
>>> >> >>>> > Sent from my iPhone
>>> >> >>>> >
>>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com>
>>> wrote:
>>> >> >>>> >>
>>> >> >>>> >> Found something interesting in the centos-b debug logging....
>>> >> >>>> >>
>>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes
>>> over.
>>> >> >>>> >> Notice how
>>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat"
>>> and
>>> >> >>>> >> how
>>> >> >>>> >> it still
>>> >> >>>> >> sees centos-a as connected despite the fact that there are no
>>> >> >>>> >> heartbeats
>>> >> >>>> >> anymore.
>>> >> >>>> >>
>>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>>> >> >>>> >> Thread-2]
>>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
>>> Active
>>> >> >>>> >> Cluster
>>> >> >>>> >> Coordinator
>>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>>> >> >>>> >> Thread-2]
>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
>>> heartbeats
>>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>>> >> >>>> >> Thread-1]
>>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
>>> elected
>>> >> >>>> >> Primary
>>> >> >>>> >> Node
>>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
>>> heartbeats.
>>> >> >>>> >> Will
>>> >> >>>> >> not
>>> >> >>>> >> disconnect any nodes due to lack of heartbeat
>>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
>>> Request-3]
>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>> heartbeat
>>> >> >>>> >> from
>>> >> >>>> >> centos-b:8080
>>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
>>> Request-3]
>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >> >>>> >>
>>> >> >>>> >> Calculated diff between current cluster status and node
>>> cluster
>>> >> >>>> >> status as
>>> >> >>>> >> follows:
>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>> state=CONNECTED,
>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> >> >>>> >> state=CONNECTED,
>>> >> >>>> >> updateId=42]]
>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>> state=CONNECTED,
>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> >> >>>> >> state=CONNECTED,
>>> >> >>>> >> updateId=42]]
>>> >> >>>> >> Difference: []
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
>>> Request-3]
>>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
>>> request
>>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
>>> length=2341
>>> >> >>>> >> bytes)
>>> >> >>>> >> from centos-b:8080 in 3 millis
>>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>>> >> >>>> >> 2017-05-18
>>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>>> >> >>>> >> 12:41:41,339;
>>> >> >>>> >> send
>>> >> >>>> >> took 8 millis
>>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>>> >> >>>> >> heartbeats
>>> >> >>>> >> in
>>> >> >>>> >> 93276 nanos
>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>> Request-4]
>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>>> heartbeat
>>> >> >>>> >> from
>>> >> >>>> >> centos-b:8080
>>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>>> Request-4]
>>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >> >>>> >>
>>> >> >>>> >> Calculated diff between current cluster status and node
>>> cluster
>>> >> >>>> >> status as
>>> >> >>>> >> follows:
>>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>>> state=CONNECTED,
>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> >> >>>> >> state=CONNECTED,
>>> >> >>>> >> updateId=42]]
>>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>>> state=CONNECTED,
>>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> >> >>>> >> state=CONNECTED,
>>> >> >>>> >> updateId=42]]
>>> >> >>>> >> Difference: []
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >>
>>> >> >>>> >> --
>>> >> >>>> >> View this message in context:
>>> >> >>>> >>
>>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu
>>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>> >> >>>> >> Nabble.com.
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >
>>> >
>>> >
>>>
>>
>>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Joe Witt <jo...@gmail.com>.

I believe at the state you describe that down node is now considered
disconnected.  The cluster behavior prohibits you from making changes when
it knows not all members of the cluster cannot honor the change.  If you
are sure you want to make the changes anyway and move on without that node
you should be able to remove it/delete it from the cluster.  Now you have a
cluster of two connected nodes and you can make changes.

On May 19, 2017 12:23 PM, "Neil Derraugh" <
neil.derraugh@intellifylearning.com> wrote:

> That's fair.  But for the sake of total clarity on my own part, after one
> of these disaster scenarios with a newly quorum-elected primary things
> cannot be driven through the UI and at least through parts the REST API.
>
> I just ran through the following.  We have 3 nodes A, B, C with A primary,
> and A becomes unreachable without first disconnecting.  Then B and C may (I
> haven't verified) continue operating the flow they had in the clusters'
> last "good" state.  But they do elect a new primary, as per the REST
> nifi-api/controller/cluster response.  But now the flow can't be changed,
> and in some cases it can't be reported on either, i.e. some GETs fail, like
> nifi-api/flow/process-groups/root.
>
> Are we describing the same behavior?
>
> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> If there is no longer a quorum then we cannot drive things from the UI
>> but the cluster remaining is in tact from a functioning point of view
>> other than being able to assign a primary to handle the one-off items.
>>
>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
>> <ne...@intellifylearning.com> wrote:
>> > Hi Joe,
>> >
>> > Maybe I'm missing something, but if the primary node suffers a network
>> > partition or container/vm/machine loss or becomes otherwise unreachable
>> then
>> > the cluster is unusable, at least from the UI.
>> >
>> > If that's not so please correct me.
>> >
>> > Thanks,
>> > Neil
>> >
>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:
>> >>
>> >> Neil,
>> >>
>> >> Want to make sure I understand what you're saying.  What are stating
>> >> is a single point of failure?
>> >>
>> >> Thanks
>> >> Joe
>> >>
>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>> >> <ne...@intellifylearning.com> wrote:
>> >> > Thanks for the insight Matt.
>> >> >
>> >> > It's a disaster recovery issue.  It's not something I plan on doing
>> on
>> >> > purpose.  It seems it is a single point of failure unfortunately.  I
>> can
>> >> > see
>> >> > no other way to resolve the issue other than to blow everything away
>> and
>> >> > start a new cluster.
>> >> >
>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <
>> matt.c.gilman@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Neil,
>> >> >>
>> >> >> Disconnecting a node prior to removal is the correct process. It
>> >> >> appears
>> >> >> that the check was lost going from 0.x to 1.x. Folks reported this
>> JIRA
>> >> >> [1]
>> >> >> indicating that deleting a connected node did not work. This process
>> >> >> does
>> >> >> not work because the node needs to be disconnected first. The JIRA
>> was
>> >> >> addressed by restoring the check that a node is disconnected prior
>> to
>> >> >> deletion.
>> >> >>
>> >> >> Hopefully the JIRA I filed earlier today [2] will address the
>> phantom
>> >> >> node
>> >> >> you were seeing. Until then, can you update your workaround to
>> >> >> disconnect
>> >> >> the node in question prior to deletion?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> Matt
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>> >> >>
>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> >> >> <ne...@intellifylearning.com> wrote:
>> >> >>>
>> >> >>> Pretty sure this is the problem I was describing in the "Phantom
>> Node"
>> >> >>> thread recently.
>> >> >>>
>> >> >>> If I kill non-primary nodes the cluster remains healthy despite the
>> >> >>> lost
>> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>> >> >>>
>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a
>> new
>> >> >>> primary/cluster coordinator gets elected too.
>> >> >>>
>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support
>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
>> disconnected,
>> >> >>> current
>> >> >>> state = CONNECTED).  So right now I don't have a workaround and
>> have
>> >> >>> to kill
>> >> >>> all the nodes and start over.
>> >> >>>
>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <markap14@hotmail.com
>> >
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hello,
>> >> >>>>
>> >> >>>> Just looking through this thread now. I believe that I understand
>> the
>> >> >>>> problem. I have updated the JIRA with details about what I think
>> is
>> >> >>>> the
>> >> >>>> problem and a potential remedy for the problem.
>> >> >>>>
>> >> >>>> Thanks
>> >> >>>> -Mark
>> >> >>>>
>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
>> matt.c.gilman@gmail.com>
>> >> >>>> > wrote:
>> >> >>>> >
>> >> >>>> > Thanks for the additional details. They will be helpful when
>> >> >>>> > working
>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
>> >> >>>> > active
>> >> >>>> > coordinator. This means that the coordinator effectively
>> heartbeats
>> >> >>>> > to
>> >> >>>> > itself. It appears, based on your log messages, that this is not
>> >> >>>> > happening.
>> >> >>>> > Because no heartbeats were receive from any node, the lack of
>> >> >>>> > heartbeats
>> >> >>>> > from the terminated node is not considered.
>> >> >>>> >
>> >> >>>> > Matt
>> >> >>>> >
>> >> >>>> > Sent from my iPhone
>> >> >>>> >
>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com>
>> wrote:
>> >> >>>> >>
>> >> >>>> >> Found something interesting in the centos-b debug logging....
>> >> >>>> >>
>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>> >> >>>> >> Notice how
>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
>> >> >>>> >> how
>> >> >>>> >> it still
>> >> >>>> >> sees centos-a as connected despite the fact that there are no
>> >> >>>> >> heartbeats
>> >> >>>> >> anymore.
>> >> >>>> >>
>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>> >> >>>> >> Thread-2]
>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
>> Active
>> >> >>>> >> Cluster
>> >> >>>> >> Coordinator
>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>> >> >>>> >> Thread-2]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
>> heartbeats
>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>> >> >>>> >> Thread-1]
>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
>> elected
>> >> >>>> >> Primary
>> >> >>>> >> Node
>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
>> heartbeats.
>> >> >>>> >> Will
>> >> >>>> >> not
>> >> >>>> >> disconnect any nodes due to lack of heartbeat
>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>> heartbeat
>> >> >>>> >> from
>> >> >>>> >> centos-b:8080
>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >> >>>> >>
>> >> >>>> >> Calculated diff between current cluster status and node cluster
>> >> >>>> >> status as
>> >> >>>> >> follows:
>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Difference: []
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
>> Request-3]
>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
>> request
>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
>> length=2341
>> >> >>>> >> bytes)
>> >> >>>> >> from centos-b:8080 in 3 millis
>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>> >> >>>> >> 2017-05-18
>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>> >> >>>> >> 12:41:41,339;
>> >> >>>> >> send
>> >> >>>> >> took 8 millis
>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>> >> >>>> >> heartbeats
>> >> >>>> >> in
>> >> >>>> >> 93276 nanos
>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>> Request-4]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
>> heartbeat
>> >> >>>> >> from
>> >> >>>> >> centos-b:8080
>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
>> Request-4]
>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >> >>>> >>
>> >> >>>> >> Calculated diff between current cluster status and node cluster
>> >> >>>> >> status as
>> >> >>>> >> follows:
>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
>> state=CONNECTED,
>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >> >>>> >> state=CONNECTED,
>> >> >>>> >> updateId=42]]
>> >> >>>> >> Difference: []
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> --
>> >> >>>> >> View this message in context:
>> >> >>>> >>
>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu
>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>> >> >>>> >> Nabble.com.
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

That's fair.  But for the sake of total clarity on my own part, after one
of these disaster scenarios with a newly quorum-elected primary things
cannot be driven through the UI and at least through parts the REST API.

I just ran through the following.  We have 3 nodes A, B, C with A primary,
and A becomes unreachable without first disconnecting.  Then B and C may (I
haven't verified) continue operating the flow they had in the clusters'
last "good" state.  But they do elect a new primary, as per the REST
nifi-api/controller/cluster response.  But now the flow can't be changed,
and in some cases it can't be reported on either, i.e. some GETs fail, like
nifi-api/flow/process-groups/root.

Are we describing the same behavior?

On Fri, May 19, 2017 at 11:12 AM, Joe Witt <jo...@gmail.com> wrote:

> If there is no longer a quorum then we cannot drive things from the UI
> but the cluster remaining is in tact from a functioning point of view
> other than being able to assign a primary to handle the one-off items.
>
> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
> <ne...@intellifylearning.com> wrote:
> > Hi Joe,
> >
> > Maybe I'm missing something, but if the primary node suffers a network
> > partition or container/vm/machine loss or becomes otherwise unreachable
> then
> > the cluster is unusable, at least from the UI.
> >
> > If that's not so please correct me.
> >
> > Thanks,
> > Neil
> >
> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:
> >>
> >> Neil,
> >>
> >> Want to make sure I understand what you're saying.  What are stating
> >> is a single point of failure?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
> >> <ne...@intellifylearning.com> wrote:
> >> > Thanks for the insight Matt.
> >> >
> >> > It's a disaster recovery issue.  It's not something I plan on doing on
> >> > purpose.  It seems it is a single point of failure unfortunately.  I
> can
> >> > see
> >> > no other way to resolve the issue other than to blow everything away
> and
> >> > start a new cluster.
> >> >
> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <matt.c.gilman@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Neil,
> >> >>
> >> >> Disconnecting a node prior to removal is the correct process. It
> >> >> appears
> >> >> that the check was lost going from 0.x to 1.x. Folks reported this
> JIRA
> >> >> [1]
> >> >> indicating that deleting a connected node did not work. This process
> >> >> does
> >> >> not work because the node needs to be disconnected first. The JIRA
> was
> >> >> addressed by restoring the check that a node is disconnected prior to
> >> >> deletion.
> >> >>
> >> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
> >> >> node
> >> >> you were seeing. Until then, can you update your workaround to
> >> >> disconnect
> >> >> the node in question prior to deletion?
> >> >>
> >> >> Thanks
> >> >>
> >> >> Matt
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
> >> >>
> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
> >> >> <ne...@intellifylearning.com> wrote:
> >> >>>
> >> >>> Pretty sure this is the problem I was describing in the "Phantom
> Node"
> >> >>> thread recently.
> >> >>>
> >> >>> If I kill non-primary nodes the cluster remains healthy despite the
> >> >>> lost
> >> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
> >> >>>
> >> >>> If I kill the primary it winds up with a CONNECTED status, but a new
> >> >>> primary/cluster coordinator gets elected too.
> >> >>>
> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support
> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not
> disconnected,
> >> >>> current
> >> >>> state = CONNECTED).  So right now I don't have a workaround and have
> >> >>> to kill
> >> >>> all the nodes and start over.
> >> >>>
> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello,
> >> >>>>
> >> >>>> Just looking through this thread now. I believe that I understand
> the
> >> >>>> problem. I have updated the JIRA with details about what I think is
> >> >>>> the
> >> >>>> problem and a potential remedy for the problem.
> >> >>>>
> >> >>>> Thanks
> >> >>>> -Mark
> >> >>>>
> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <
> matt.c.gilman@gmail.com>
> >> >>>> > wrote:
> >> >>>> >
> >> >>>> > Thanks for the additional details. They will be helpful when
> >> >>>> > working
> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
> >> >>>> > active
> >> >>>> > coordinator. This means that the coordinator effectively
> heartbeats
> >> >>>> > to
> >> >>>> > itself. It appears, based on your log messages, that this is not
> >> >>>> > happening.
> >> >>>> > Because no heartbeats were receive from any node, the lack of
> >> >>>> > heartbeats
> >> >>>> > from the terminated node is not considered.
> >> >>>> >
> >> >>>> > Matt
> >> >>>> >
> >> >>>> > Sent from my iPhone
> >> >>>> >
> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com>
> wrote:
> >> >>>> >>
> >> >>>> >> Found something interesting in the centos-b debug logging....
> >> >>>> >>
> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
> >> >>>> >> Notice how
> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
> >> >>>> >> how
> >> >>>> >> it still
> >> >>>> >> sees centos-a as connected despite the fact that there are no
> >> >>>> >> heartbeats
> >> >>>> >> anymore.
> >> >>>> >>
> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
> >> >>>> >> Thread-2]
> >> >>>> >> o.apache.nifi.controller.FlowController This node elected
> Active
> >> >>>> >> Cluster
> >> >>>> >> Coordinator
> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
> >> >>>> >> Thread-2]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old
> heartbeats
> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
> >> >>>> >> Thread-1]
> >> >>>> >> o.apache.nifi.controller.FlowController This node has been
> elected
> >> >>>> >> Primary
> >> >>>> >> Node
> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new
> heartbeats.
> >> >>>> >> Will
> >> >>>> >> not
> >> >>>> >> disconnect any nodes due to lack of heartbeat
> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
> heartbeat
> >> >>>> >> from
> >> >>>> >> centos-b:8080
> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >> >>>> >>
> >> >>>> >> Calculated diff between current cluster status and node cluster
> >> >>>> >> status as
> >> >>>> >> follows:
> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Difference: []
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol
> Request-3]
> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing
> request
> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT,
> length=2341
> >> >>>> >> bytes)
> >> >>>> >> from centos-b:8080 in 3 millis
> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
> >> >>>> >> 2017-05-18
> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
> >> >>>> >> 12:41:41,339;
> >> >>>> >> send
> >> >>>> >> took 8 millis
> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
> >> >>>> >> heartbeats
> >> >>>> >> in
> >> >>>> >> 93276 nanos
> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
> Request-4]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new
> heartbeat
> >> >>>> >> from
> >> >>>> >> centos-b:8080
> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol
> Request-4]
> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >> >>>> >>
> >> >>>> >> Calculated diff between current cluster status and node cluster
> >> >>>> >> status as
> >> >>>> >> follows:
> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080,
> state=CONNECTED,
> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >> >>>> >> state=CONNECTED,
> >> >>>> >> updateId=42]]
> >> >>>> >> Difference: []
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> --
> >> >>>> >> View this message in context:
> >> >>>> >>
> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-
> Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
> >> >>>> >> Nabble.com.
> >> >>>>
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Joe Witt <jo...@gmail.com>.

If there is no longer a quorum then we cannot drive things from the UI
but the cluster remaining is in tact from a functioning point of view
other than being able to assign a primary to handle the one-off items.

On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh
<ne...@intellifylearning.com> wrote:
> Hi Joe,
>
> Maybe I'm missing something, but if the primary node suffers a network
> partition or container/vm/machine loss or becomes otherwise unreachable then
> the cluster is unusable, at least from the UI.
>
> If that's not so please correct me.
>
> Thanks,
> Neil
>
> On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Neil,
>>
>> Want to make sure I understand what you're saying.  What are stating
>> is a single point of failure?
>>
>> Thanks
>> Joe
>>
>> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
>> <ne...@intellifylearning.com> wrote:
>> > Thanks for the insight Matt.
>> >
>> > It's a disaster recovery issue.  It's not something I plan on doing on
>> > purpose.  It seems it is a single point of failure unfortunately.  I can
>> > see
>> > no other way to resolve the issue other than to blow everything away and
>> > start a new cluster.
>> >
>> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <ma...@gmail.com>
>> > wrote:
>> >>
>> >> Neil,
>> >>
>> >> Disconnecting a node prior to removal is the correct process. It
>> >> appears
>> >> that the check was lost going from 0.x to 1.x. Folks reported this JIRA
>> >> [1]
>> >> indicating that deleting a connected node did not work. This process
>> >> does
>> >> not work because the node needs to be disconnected first. The JIRA was
>> >> addressed by restoring the check that a node is disconnected prior to
>> >> deletion.
>> >>
>> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
>> >> node
>> >> you were seeing. Until then, can you update your workaround to
>> >> disconnect
>> >> the node in question prior to deletion?
>> >>
>> >> Thanks
>> >>
>> >> Matt
>> >>
>> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
>> >>
>> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> >> <ne...@intellifylearning.com> wrote:
>> >>>
>> >>> Pretty sure this is the problem I was describing in the "Phantom Node"
>> >>> thread recently.
>> >>>
>> >>> If I kill non-primary nodes the cluster remains healthy despite the
>> >>> lost
>> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>> >>>
>> >>> If I kill the primary it winds up with a CONNECTED status, but a new
>> >>> primary/cluster coordinator gets elected too.
>> >>>
>> >>> Additionally it seems in 1.2.0 that the REST API no longer support
>> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
>> >>> current
>> >>> state = CONNECTED).  So right now I don't have a workaround and have
>> >>> to kill
>> >>> all the nodes and start over.
>> >>>
>> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> Just looking through this thread now. I believe that I understand the
>> >>>> problem. I have updated the JIRA with details about what I think is
>> >>>> the
>> >>>> problem and a potential remedy for the problem.
>> >>>>
>> >>>> Thanks
>> >>>> -Mark
>> >>>>
>> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > Thanks for the additional details. They will be helpful when
>> >>>> > working
>> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
>> >>>> > active
>> >>>> > coordinator. This means that the coordinator effectively heartbeats
>> >>>> > to
>> >>>> > itself. It appears, based on your log messages, that this is not
>> >>>> > happening.
>> >>>> > Because no heartbeats were receive from any node, the lack of
>> >>>> > heartbeats
>> >>>> > from the terminated node is not considered.
>> >>>> >
>> >>>> > Matt
>> >>>> >
>> >>>> > Sent from my iPhone
>> >>>> >
>> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
>> >>>> >>
>> >>>> >> Found something interesting in the centos-b debug logging....
>> >>>> >>
>> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>> >>>> >> Notice how
>> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and
>> >>>> >> how
>> >>>> >> it still
>> >>>> >> sees centos-a as connected despite the fact that there are no
>> >>>> >> heartbeats
>> >>>> >> anymore.
>> >>>> >>
>> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
>> >>>> >> Thread-2]
>> >>>> >> o.apache.nifi.controller.FlowController This node elected Active
>> >>>> >> Cluster
>> >>>> >> Coordinator
>> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
>> >>>> >> Thread-2]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
>> >>>> >> Thread-1]
>> >>>> >> o.apache.nifi.controller.FlowController This node has been elected
>> >>>> >> Primary
>> >>>> >> Node
>> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
>> >>>> >> Will
>> >>>> >> not
>> >>>> >> disconnect any nodes due to lack of heartbeat
>> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> >>>> >> from
>> >>>> >> centos-b:8080
>> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>>> >>
>> >>>> >> Calculated diff between current cluster status and node cluster
>> >>>> >> status as
>> >>>> >> follows:
>> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Difference: []
>> >>>> >>
>> >>>> >>
>> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>> >>>> >> bytes)
>> >>>> >> from centos-b:8080 in 3 millis
>> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
>> >>>> >> 2017-05-18
>> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18
>> >>>> >> 12:41:41,339;
>> >>>> >> send
>> >>>> >> took 8 millis
>> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>> >>>> >> heartbeats
>> >>>> >> in
>> >>>> >> 93276 nanos
>> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> >>>> >> from
>> >>>> >> centos-b:8080
>> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>>> >>
>> >>>> >> Calculated diff between current cluster status and node cluster
>> >>>> >> status as
>> >>>> >> follows:
>> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> >>>> >> state=CONNECTED,
>> >>>> >> updateId=42]]
>> >>>> >> Difference: []
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >> --
>> >>>> >> View this message in context:
>> >>>> >>
>> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
>> >>>> >> Nabble.com.
>> >>>>
>> >>>
>> >>
>> >
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

Hi Joe,

Maybe I'm missing something, but if the primary node suffers a network
partition or container/vm/machine loss or becomes otherwise unreachable
then the cluster is unusable, at least from the UI.

If that's not so please correct me.

Thanks,
Neil

On Thu, May 18, 2017 at 9:56 PM, Joe Witt <jo...@gmail.com> wrote:

> Neil,
>
> Want to make sure I understand what you're saying.  What are stating
> is a single point of failure?
>
> Thanks
> Joe
>
> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
> <ne...@intellifylearning.com> wrote:
> > Thanks for the insight Matt.
> >
> > It's a disaster recovery issue.  It's not something I plan on doing on
> > purpose.  It seems it is a single point of failure unfortunately.  I can
> see
> > no other way to resolve the issue other than to blow everything away and
> > start a new cluster.
> >
> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <ma...@gmail.com>
> > wrote:
> >>
> >> Neil,
> >>
> >> Disconnecting a node prior to removal is the correct process. It appears
> >> that the check was lost going from 0.x to 1.x. Folks reported this JIRA
> [1]
> >> indicating that deleting a connected node did not work. This process
> does
> >> not work because the node needs to be disconnected first. The JIRA was
> >> addressed by restoring the check that a node is disconnected prior to
> >> deletion.
> >>
> >> Hopefully the JIRA I filed earlier today [2] will address the phantom
> node
> >> you were seeing. Until then, can you update your workaround to
> disconnect
> >> the node in question prior to deletion?
> >>
> >> Thanks
> >>
> >> Matt
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-3295
> >> [2] https://issues.apache.org/jira/browse/NIFI-3933
> >>
> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
> >> <ne...@intellifylearning.com> wrote:
> >>>
> >>> Pretty sure this is the problem I was describing in the "Phantom Node"
> >>> thread recently.
> >>>
> >>> If I kill non-primary nodes the cluster remains healthy despite the
> lost
> >>> nodes.  The terminated nodes end up with a DISCONNECTED status.
> >>>
> >>> If I kill the primary it winds up with a CONNECTED status, but a new
> >>> primary/cluster coordinator gets elected too.
> >>>
> >>> Additionally it seems in 1.2.0 that the REST API no longer support
> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID
> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
> current
> >>> state = CONNECTED).  So right now I don't have a workaround and have
> to kill
> >>> all the nodes and start over.
> >>>
> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Just looking through this thread now. I believe that I understand the
> >>>> problem. I have updated the JIRA with details about what I think is
> the
> >>>> problem and a potential remedy for the problem.
> >>>>
> >>>> Thanks
> >>>> -Mark
> >>>>
> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > Thanks for the additional details. They will be helpful when working
> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to the
> active
> >>>> > coordinator. This means that the coordinator effectively heartbeats
> to
> >>>> > itself. It appears, based on your log messages, that this is not
> happening.
> >>>> > Because no heartbeats were receive from any node, the lack of
> heartbeats
> >>>> > from the terminated node is not considered.
> >>>> >
> >>>> > Matt
> >>>> >
> >>>> > Sent from my iPhone
> >>>> >
> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
> >>>> >>
> >>>> >> Found something interesting in the centos-b debug logging....
> >>>> >>
> >>>> >> after centos-a (the coordinator) is killed centos-b takes over.
> >>>> >> Notice how
> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
> >>>> >> it still
> >>>> >> sees centos-a as connected despite the fact that there are no
> >>>> >> heartbeats
> >>>> >> anymore.
> >>>> >>
> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification
> Thread-2]
> >>>> >> o.apache.nifi.controller.FlowController This node elected Active
> >>>> >> Cluster
> >>>> >> Coordinator
> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification
> Thread-2]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification
> Thread-1]
> >>>> >> o.apache.nifi.controller.FlowController This node has been elected
> >>>> >> Primary
> >>>> >> Node
> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
> Will
> >>>> >> not
> >>>> >> disconnect any nodes due to lack of heartbeat
> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
> >>>> >> from
> >>>> >> centos-b:8080
> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>>> >>
> >>>> >> Calculated diff between current cluster status and node cluster
> >>>> >> status as
> >>>> >> follows:
> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Difference: []
> >>>> >>
> >>>> >>
> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
> >>>> >> bytes)
> >>>> >> from centos-b:8080 in 3 millis
> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at
> 2017-05-18
> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
> >>>> >> send
> >>>> >> took 8 millis
> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
> heartbeats
> >>>> >> in
> >>>> >> 93276 nanos
> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
> >>>> >> from
> >>>> >> centos-b:8080
> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>>> >>
> >>>> >> Calculated diff between current cluster status and node cluster
> >>>> >> status as
> >>>> >> follows:
> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> >>>> >> state=CONNECTED,
> >>>> >> updateId=42]]
> >>>> >> Difference: []
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> View this message in context:
> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-
> Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
> >>>> >> Sent from the Apache NiFi Users List mailing list archive at
> >>>> >> Nabble.com.
> >>>>
> >>>
> >>
> >
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Joe Witt <jo...@gmail.com>.

Neil,

Want to make sure I understand what you're saying.  What are stating
is a single point of failure?

Thanks
Joe

On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh
<ne...@intellifylearning.com> wrote:
> Thanks for the insight Matt.
>
> It's a disaster recovery issue.  It's not something I plan on doing on
> purpose.  It seems it is a single point of failure unfortunately.  I can see
> no other way to resolve the issue other than to blow everything away and
> start a new cluster.
>
> On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <ma...@gmail.com>
> wrote:
>>
>> Neil,
>>
>> Disconnecting a node prior to removal is the correct process. It appears
>> that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1]
>> indicating that deleting a connected node did not work. This process does
>> not work because the node needs to be disconnected first. The JIRA was
>> addressed by restoring the check that a node is disconnected prior to
>> deletion.
>>
>> Hopefully the JIRA I filed earlier today [2] will address the phantom node
>> you were seeing. Until then, can you update your workaround to disconnect
>> the node in question prior to deletion?
>>
>> Thanks
>>
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-3295
>> [2] https://issues.apache.org/jira/browse/NIFI-3933
>>
>> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh
>> <ne...@intellifylearning.com> wrote:
>>>
>>> Pretty sure this is the problem I was describing in the "Phantom Node"
>>> thread recently.
>>>
>>> If I kill non-primary nodes the cluster remains healthy despite the lost
>>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>>
>>> If I kill the primary it winds up with a CONNECTED status, but a new
>>> primary/cluster coordinator gets elected too.
>>>
>>> Additionally it seems in 1.2.0 that the REST API no longer support
>>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected, current
>>> state = CONNECTED).  So right now I don't have a workaround and have to kill
>>> all the nodes and start over.
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Just looking through this thread now. I believe that I understand the
>>>> problem. I have updated the JIRA with details about what I think is the
>>>> problem and a potential remedy for the problem.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > Thanks for the additional details. They will be helpful when working
>>>> > the JIRA. All nodes, including the coordinator, heartbeat to the active
>>>> > coordinator. This means that the coordinator effectively heartbeats to
>>>> > itself. It appears, based on your log messages, that this is not happening.
>>>> > Because no heartbeats were receive from any node, the lack of heartbeats
>>>> > from the terminated node is not considered.
>>>> >
>>>> > Matt
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
>>>> >>
>>>> >> Found something interesting in the centos-b debug logging....
>>>> >>
>>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>>>> >> Notice how
>>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
>>>> >> it still
>>>> >> sees centos-a as connected despite the fact that there are no
>>>> >> heartbeats
>>>> >> anymore.
>>>> >>
>>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>>>> >> o.apache.nifi.controller.FlowController This node elected Active
>>>> >> Cluster
>>>> >> Coordinator
>>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>>>> >> o.apache.nifi.controller.FlowController This node has been elected
>>>> >> Primary
>>>> >> Node
>>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will
>>>> >> not
>>>> >> disconnect any nodes due to lack of heartbeat
>>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>>> >> from
>>>> >> centos-b:8080
>>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >>
>>>> >> Calculated diff between current cluster status and node cluster
>>>> >> status as
>>>> >> follows:
>>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Difference: []
>>>> >>
>>>> >>
>>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>>>> >> bytes)
>>>> >> from centos-b:8080 in 3 millis
>>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
>>>> >> send
>>>> >> took 8 millis
>>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats
>>>> >> in
>>>> >> 93276 nanos
>>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>>> >> from
>>>> >> centos-b:8080
>>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>>> >>
>>>> >> Calculated diff between current cluster status and node cluster
>>>> >> status as
>>>> >> follows:
>>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>>> >> state=CONNECTED,
>>>> >> updateId=42]]
>>>> >> Difference: []
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> View this message in context:
>>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>>> >> Nabble.com.
>>>>
>>>
>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

Thanks for the insight Matt.

It's a disaster recovery issue.  It's not something I plan on doing on
purpose.  It seems it is a single point of failure unfortunately.  I can
see no other way to resolve the issue other than to blow everything away
and start a new cluster.

On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <ma...@gmail.com>
wrote:

> Neil,
>
> Disconnecting a node prior to removal is the correct process. It appears
> that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1]
> indicating that deleting a connected node did not work. This process does
> not work because the node needs to be disconnected first. The JIRA was
> addressed by restoring the check that a node is disconnected prior to
> deletion.
>
> Hopefully the JIRA I filed earlier today [2] will address the phantom node
> you were seeing. Until then, can you update your workaround to disconnect
> the node in question prior to deletion?
>
> Thanks
>
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-3295
> [2] https://issues.apache.org/jira/browse/NIFI-3933
>
> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh <neil.derraugh@
> intellifylearning.com> wrote:
>
>> Pretty sure this is the problem I was describing in the "Phantom Node"
>> thread recently.
>>
>> If I kill non-primary nodes the cluster remains healthy despite the lost
>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>
>> If I kill the primary it winds up with a CONNECTED status, but a new
>> primary/cluster coordinator gets elected too.
>>
>> Additionally it seems in 1.2.0 that the REST API no longer support
>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
>> current state = CONNECTED).  So right now I don't have a workaround and
>> have to kill all the nodes and start over.
>>
>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Just looking through this thread now. I believe that I understand the
>>> problem. I have updated the JIRA with details about what I think is the
>>> problem and a potential remedy for the problem.
>>>
>>> Thanks
>>> -Mark
>>>
>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
>>> wrote:
>>> >
>>> > Thanks for the additional details. They will be helpful when working
>>> the JIRA. All nodes, including the coordinator, heartbeat to the active
>>> coordinator. This means that the coordinator effectively heartbeats to
>>> itself. It appears, based on your log messages, that this is not happening.
>>> Because no heartbeats were receive from any node, the lack of heartbeats
>>> from the terminated node is not considered.
>>> >
>>> > Matt
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
>>> >>
>>> >> Found something interesting in the centos-b debug logging....
>>> >>
>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>>> Notice how
>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
>>> it still
>>> >> sees centos-a as connected despite the fact that there are no
>>> heartbeats
>>> >> anymore.
>>> >>
>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>>> >> o.apache.nifi.controller.FlowController This node elected Active
>>> Cluster
>>> >> Coordinator
>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>>> >> o.apache.nifi.controller.FlowController This node has been elected
>>> Primary
>>> >> Node
>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
>>> Will not
>>> >> disconnect any nodes due to lack of heartbeat
>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>> from
>>> >> centos-b:8080
>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >>
>>> >> Calculated diff between current cluster status and node cluster
>>> status as
>>> >> follows:
>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Difference: []
>>> >>
>>> >>
>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>>> bytes)
>>> >> from centos-b:8080 in 3 millis
>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
>>> send
>>> >> took 8 millis
>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>>> heartbeats in
>>> >> 93276 nanos
>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>> from
>>> >> centos-b:8080
>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >>
>>> >> Calculated diff between current cluster status and node cluster
>>> status as
>>> >> follows:
>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Difference: []
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context: http://apache-nifi-users-list.
>>> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-
>>> when-node-was-killed-tp1942p1950.html
>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>> Nabble.com.
>>>
>>>
>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Matt Gilman <ma...@gmail.com>.

Neil,

Disconnecting a node prior to removal is the correct process. It appears
that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1]
indicating that deleting a connected node did not work. This process does
not work because the node needs to be disconnected first. The JIRA was
addressed by restoring the check that a node is disconnected prior to
deletion.

Hopefully the JIRA I filed earlier today [2] will address the phantom node
you were seeing. Until then, can you update your workaround to disconnect
the node in question prior to deletion?

Thanks

Matt

[1] https://issues.apache.org/jira/browse/NIFI-3295
[2] https://issues.apache.org/jira/browse/NIFI-3933

On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh <
neil.derraugh@intellifylearning.com> wrote:

> Pretty sure this is the problem I was describing in the "Phantom Node"
> thread recently.
>
> If I kill non-primary nodes the cluster remains healthy despite the lost
> nodes.  The terminated nodes end up with a DISCONNECTED status.
>
> If I kill the primary it winds up with a CONNECTED status, but a new
> primary/cluster coordinator gets elected too.
>
> Additionally it seems in 1.2.0 that the REST API no longer support
> deleting a node in a CONNECTED state (Cannot remove Node with ID
> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
> current state = CONNECTED).  So right now I don't have a workaround and
> have to kill all the nodes and start over.
>
> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com> wrote:
>
>> Hello,
>>
>> Just looking through this thread now. I believe that I understand the
>> problem. I have updated the JIRA with details about what I think is the
>> problem and a potential remedy for the problem.
>>
>> Thanks
>> -Mark
>>
>> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
>> wrote:
>> >
>> > Thanks for the additional details. They will be helpful when working
>> the JIRA. All nodes, including the coordinator, heartbeat to the active
>> coordinator. This means that the coordinator effectively heartbeats to
>> itself. It appears, based on your log messages, that this is not happening.
>> Because no heartbeats were receive from any node, the lack of heartbeats
>> from the terminated node is not considered.
>> >
>> > Matt
>> >
>> > Sent from my iPhone
>> >
>> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
>> >>
>> >> Found something interesting in the centos-b debug logging....
>> >>
>> >> after centos-a (the coordinator) is killed centos-b takes over. Notice
>> how
>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how it
>> still
>> >> sees centos-a as connected despite the fact that there are no
>> heartbeats
>> >> anymore.
>> >>
>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>> >> o.apache.nifi.controller.FlowController This node elected Active
>> Cluster
>> >> Coordinator
>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>> >> o.apache.nifi.controller.FlowController This node has been elected
>> Primary
>> >> Node
>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will
>> not
>> >> disconnect any nodes due to lack of heartbeat
>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> from
>> >> centos-b:8080
>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>
>> >> Calculated diff between current cluster status and node cluster status
>> as
>> >> follows:
>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> state=CONNECTED,
>> >> updateId=42]]
>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> state=CONNECTED,
>> >> updateId=42]]
>> >> Difference: []
>> >>
>> >>
>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>> bytes)
>> >> from centos-b:8080 in 3 millis
>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
>> send
>> >> took 8 millis
>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats
>> in
>> >> 93276 nanos
>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>> from
>> >> centos-b:8080
>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>> >>
>> >> Calculated diff between current cluster status and node cluster status
>> as
>> >> follows:
>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> state=CONNECTED,
>> >> updateId=42]]
>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>> state=CONNECTED,
>> >> updateId=42]]
>> >> Difference: []
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://apache-nifi-users-list.
>> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-
>> when-node-was-killed-tp1942p1950.html
>> >> Sent from the Apache NiFi Users List mailing list archive at
>> Nabble.com.
>>
>>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Neil Derraugh <ne...@intellifylearning.com>.

Pretty sure this is the problem I was describing in the "Phantom Node"
thread recently.

If I kill non-primary nodes the cluster remains healthy despite the lost
nodes.  The terminated nodes end up with a DISCONNECTED status.

If I kill the primary it winds up with a CONNECTED status, but a new
primary/cluster coordinator gets elected too.

Additionally it seems in 1.2.0 that the REST API no longer support deleting
a node in a CONNECTED state (Cannot remove Node with ID
1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
current state = CONNECTED).  So right now I don't have a workaround and
have to kill all the nodes and start over.

On Thu, May 18, 2017 at 11:20 AM, Mark Payne <ma...@hotmail.com> wrote:

> Hello,
>
> Just looking through this thread now. I believe that I understand the
> problem. I have updated the JIRA with details about what I think is the
> problem and a potential remedy for the problem.
>
> Thanks
> -Mark
>
> > On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com>
> wrote:
> >
> > Thanks for the additional details. They will be helpful when working the
> JIRA. All nodes, including the coordinator, heartbeat to the active
> coordinator. This means that the coordinator effectively heartbeats to
> itself. It appears, based on your log messages, that this is not happening.
> Because no heartbeats were receive from any node, the lack of heartbeats
> from the terminated node is not considered.
> >
> > Matt
> >
> > Sent from my iPhone
> >
> >> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
> >>
> >> Found something interesting in the centos-b debug logging....
> >>
> >> after centos-a (the coordinator) is killed centos-b takes over. Notice
> how
> >> it "Will not disconnect any nodes due to lack of heartbeat" and how it
> still
> >> sees centos-a as connected despite the fact that there are no heartbeats
> >> anymore.
> >>
> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
> >> o.apache.nifi.controller.FlowController This node elected Active
> Cluster
> >> Coordinator
> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
> >> o.apache.nifi.controller.FlowController This node has been elected
> Primary
> >> Node
> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will
> not
> >> disconnect any nodes due to lack of heartbeat
> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> >> centos-b:8080
> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>
> >> Calculated diff between current cluster status and node cluster status
> as
> >> follows:
> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Difference: []
> >>
> >>
> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
> bytes)
> >> from centos-b:8080 in 3 millis
> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send
> >> took 8 millis
> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats
> in
> >> 93276 nanos
> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> >> centos-b:8080
> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
> >>
> >> Calculated diff between current cluster status and node cluster status
> as
> >> follows:
> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
> state=CONNECTED,
> >> updateId=42]]
> >> Difference: []
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942p1950.html
> >> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Mark Payne <ma...@hotmail.com>.

Hello,

Just looking through this thread now. I believe that I understand the problem. I have updated the JIRA with details about what I think is the problem and a potential remedy for the problem.

Thanks
-Mark

> On May 18, 2017, at 9:49 AM, Matt Gilman <ma...@gmail.com> wrote:
> 
> Thanks for the additional details. They will be helpful when working the JIRA. All nodes, including the coordinator, heartbeat to the active coordinator. This means that the coordinator effectively heartbeats to itself. It appears, based on your log messages, that this is not happening. Because no heartbeats were receive from any node, the lack of heartbeats from the terminated node is not considered.
> 
> Matt
> 
> Sent from my iPhone
> 
>> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
>> 
>> Found something interesting in the centos-b debug logging.... 
>> 
>> after centos-a (the coordinator) is killed centos-b takes over. Notice how
>> it "Will not disconnect any nodes due to lack of heartbeat" and how it still
>> sees centos-a as connected despite the fact that there are no heartbeats
>> anymore.
>> 
>> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>> o.apache.nifi.controller.FlowController This node elected Active Cluster
>> Coordinator
>> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>> o.apache.nifi.controller.FlowController This node has been elected Primary
>> Node
>> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will not
>> disconnect any nodes due to lack of heartbeat
>> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
>> centos-b:8080
>> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 
>> 
>> Calculated diff between current cluster status and node cluster status as
>> follows:
>> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
>> updateId=42]]
>> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
>> updateId=42]]
>> Difference: []
>> 
>> 
>> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 bytes)
>> from centos-b:8080 in 3 millis
>> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send
>> took 8 millis
>> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats in
>> 93276 nanos
>> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
>> centos-b:8080
>> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 
>> 
>> Calculated diff between current cluster status and node cluster status as
>> follows:
>> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
>> updateId=42]]
>> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
>> updateId=42]]
>> Difference: []
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
>> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

Hi,
 
Just wanted to point out that the newly appointed coordinator (centos-b)
does end up sending heartbeats to itself as you described. 

2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
centos-b:8080

It seems heartbeats are purged when a new coordinator is selected.

https://github.com/apache/nifi/blob/b73ba7f8d4f6319881c26b8faad121ceb12041ab/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-cluster/src/main/java/org/apache/nifi/cluster/coordination/heartbeat/ClusterProtocolHeartbeatMonitor.java#L136

And disconnecting nodes can only be done based on existing heartbeats.

https://github.com/apache/nifi/blob/d838f61291d2582592754a37314911b701c6891b/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-cluster/src/main/java/org/apache/nifi/cluster/coordination/heartbeat/AbstractHeartbeatMonitor.java#L162

As the centos-a heartbeats were purged, centos-a never gets disconnected.




--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1954.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Matt Gilman <ma...@gmail.com>.

Thanks for the additional details. They will be helpful when working the JIRA. All nodes, including the coordinator, heartbeat to the active coordinator. This means that the coordinator effectively heartbeats to itself. It appears, based on your log messages, that this is not happening. Because no heartbeats were receive from any node, the lack of heartbeats from the terminated node is not considered.

Matt

Sent from my iPhone

> On May 18, 2017, at 8:30 AM, ddewaele <dd...@gmail.com> wrote:
> 
> Found something interesting in the centos-b debug logging.... 
> 
> after centos-a (the coordinator) is killed centos-b takes over. Notice how
> it "Will not disconnect any nodes due to lack of heartbeat" and how it still
> sees centos-a as connected despite the fact that there are no heartbeats
> anymore.
> 
> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
> o.apache.nifi.controller.FlowController This node elected Active Cluster
> Coordinator
> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
> o.apache.nifi.controller.FlowController This node has been elected Primary
> Node
> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will not
> disconnect any nodes due to lack of heartbeat
> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> centos-b:8080
> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 
> 
> Calculated diff between current cluster status and node cluster status as
> follows:
> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
> updateId=42]]
> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
> updateId=42]]
> Difference: []
> 
> 
> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 bytes)
> from centos-b:8080 in 3 millis
> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send
> took 8 millis
> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats in
> 93276 nanos
> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
> centos-b:8080
> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 
> 
> Calculated diff between current cluster status and node cluster status as
> follows:
> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
> updateId=42]]
> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
> updateId=42]]
> Difference: []
> 
> 
> 
> 
> --
> View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

Found something interesting in the centos-b debug logging.... 

after centos-a (the coordinator) is killed centos-b takes over. Notice how
it "Will not disconnect any nodes due to lack of heartbeat" and how it still
sees centos-a as connected despite the fact that there are no heartbeats
anymore.

2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
o.apache.nifi.controller.FlowController This node elected Active Cluster
Coordinator
2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
o.apache.nifi.controller.FlowController This node has been elected Primary
Node
2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. Will not
disconnect any nodes due to lack of heartbeat
2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
centos-b:8080
2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 

Calculated diff between current cluster status and node cluster status as
follows:
Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
updateId=42]]
Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
updateId=42]]
Difference: []


2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
o.a.n.c.p.impl.SocketProtocolListener Finished processing request
410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 bytes)
from centos-b:8080 in 3 millis
2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; send
took 8 millis
2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 heartbeats in
93276 nanos
2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat from
centos-b:8080
2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor 

Calculated diff between current cluster status and node cluster status as
follows:
Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
updateId=42]]
Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED,
updateId=42]]
Difference: []




--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by Matt Gilman <ma...@gmail.com>.

Hi,

Once the new coordinator was elected, it is responsible for disconnecting
nodes due to lack of heartbeat. It will wait 8 times the
configured nifi.cluster.protocol.heartbeat.interval before the node
is disconnected. Can you confirm that this amount of time has elapsed?

Did you see any messages containing "Have not received a heartbeat from
node in" or "Failed to remove heartbeat for" during this time? Can you
describe your environment a little more? Are you running an external or
embedded zookeeper?

Can you enabled debug level logging for this package?
org.apache.nifi.cluster.coordination.heartbeat

Thanks

Matt

On Thu, May 18, 2017 at 7:30 AM, ddewaele <dd...@gmail.com> wrote:

> Hi,
>
> I have a NiFi cluster up and running and I'm testing various failover
> scenarios.
>
> I have 2 nodes in the cluster :
>
> - centos-a : Coordinator node / primary
> - centos-b : Cluster node
>
> I noticed in 1 of the scenarios when I killed the Cluster Coordinator node,
> that the following happened :
>
> centos-b couldn't contact the coordinator anymore and became the new
> coordinator / primary node. (as expected) :
>
> Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message
> to Cluster Coordinator due to: java.net.ConnectException: Connection
> refused
> (Connection refused)
> This node has been elected Leader for Role 'Primary Node'
> This node has been elected Leader for Role 'Cluster Coordinator'
>
> When attempting to access the UI on centos-b, I got the following error :
>
> 2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET
> /nifi-api/flow/current-user to centos-a:8080 due to {}
>
> If my understanding is correct, NiFi will try to replicate to connected
> nodes in the cluster. Here, centos-a was killed a while back and should
> have
> been disconnected, but as far as NiFi was concerned it was still connected.
>
> As a result I cannot access the UI anymore (due to the replication error),
> but I can lookup the cluster info via the REST API. And sure enough, it
> still sees centos-a as being CONNECTED.
>
> {
>     "cluster": {
>         "generated": "11:20:13 UTC",
>         "nodes": [
>             {
>                 "activeThreadCount": 0,
>                 "address": "centos-b",
>                 "apiPort": 8080,
>                 "events": [
>                     {
>                         "category": "INFO",
>                         "message": "Node Status changed from CONNECTING to
> CONNECTED",
>                         "timestamp": "05/18/2017 11:17:31 UTC"
>                     },
>                     {
>                         "category": "INFO",
>                         "message": "Node Status changed from [Unknown Node]
> to CONNECTING",
>                         "timestamp": "05/18/2017 11:17:27 UTC"
>                     }
>                 ],
>                 "heartbeat": "05/18/2017 11:20:09 UTC",
>                 "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9",
>                 "nodeStartTime": "05/18/2017 11:17:25 UTC",
>                 "queued": "8,492 / 13.22 MB",
>                 "roles": [
>                     "Primary Node",
>                     "Cluster Coordinator"
>                 ],
>                 "status": "CONNECTED"
>             },
>             {
>                 "address": "centos-a",
>                 "apiPort": 8080,
>                 "events": [],
>                 "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892",
>                 "roles": [],
>                 "status": "CONNECTED"
>             }
>         ]
>     }
> }
>
> When centos-a was brought back online, i noticed the following status
> change
> :
>
> Status of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15]
> to
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19]
>
> So it went from connected -> connecting.
>
> It clearly missed the disconnected step here.
>
> When shutting down the centos-a node using nifi.sh stop, it goes into the
> DISCONNECTED state :
>
> Status of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12]
> to
> NodeConnectionStatus[nodeId=centos-a:8
> 080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect
> Reason=Node was Shutdown, updateId=13]
>
> How can I debug this further, and can somebody provide some additional
> insights ? I have seen nodes getting disconnected due to missing heartbeats
>
> tatus of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10]
> to
> NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect
> Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat
> from
> node in 41 seconds, updateId=11]
>
> But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it
> is CONNECTED, despite not having received heartbeats in ages.
>
> Any ideas ?
>
>
>
> --
> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>

Re: Nifi Cluster fails to disconnect node when node was killed

Posted by ddewaele <dd...@gmail.com>.

I can reproduce the issue by killing the java processes associated with the
cluster coordinator node.

The NiFi UI will not be accessible anymore until that particular node is
brought up again, or until the node entry is removed from the cluster (via
the REST API).

Killing non-coordinator nodes does result in nifi detected heartbeat loss
and flagging it as DISCONNECTED.




--
View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-when-node-was-killed-tp1942p1947.html
Sent from the Apache NiFi Users List mailing list archive at Nabble.com.