You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Alexey Goncharuk (JIRA)" <ji...@apache.org> on 2019/03/01 10:49:00 UTC

[jira] [Commented] (IGNITE-11394) Infinite No next node in topology messages during node restart scenario

    [ https://issues.apache.org/jira/browse/IGNITE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781567#comment-16781567 ] 

Alexey Goncharuk commented on IGNITE-11394:
-------------------------------------------

[~sergey-chugunov], this is the first solution I tried, but dropping the metrics update message breaks client discovery tests for some reason. I will create an additional ticket for investigation shortly.

> Infinite No next node in topology messages during node restart scenario
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11394
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11394
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexey Goncharuk
>            Assignee: Alexey Goncharuk
>            Priority: Major
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I observe a situation with the following symptoms during a cycled nodes restart:
>  - A node being joining to the cluster sends join request, receives NodeAddedMessage and awaits NodeAddFinishedMessage
>  - The node receives a metrics update message, the message is in the queue
>  - The whole cluster is being restarted, a new ring is formed
>  - The node re-sends the join request, it is successfully process by the ring
>  - The node added message is received by the joining node
>  - The node detects that it cannot send messages (failed nodes contains all ring remote nodes)
>  - Sine there was already a metrics update message in the queue, the node attempts to re-add the message to the queue. Since the metrics update message is a high priority message, it is added to the head of the queue and the node gets stuck in an infinite loop
> I suggest to drop metrics update message in {{sendMessageAcrossRing}} if we see the {{No next node in topology}} situation.
> Another question is why don't we pass the collection of failed nodes to the {{ring.hasRemoteNodes()}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)