You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Andrew Mashenkov (JIRA)" <ji...@apache.org> on 2017/09/04 15:13:00 UTC
[jira] [Assigned] (IGNITE-6256) When a node becomes segmented an AssertionError is thrown during GridDhtPartitionTopologyImpl.removeNode

     [ https://issues.apache.org/jira/browse/IGNITE-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Mashenkov reassigned IGNITE-6256:
----------------------------------------

    Assignee: Andrew Mashenkov

> When a node becomes segmented an AssertionError is thrown during GridDhtPartitionTopologyImpl.removeNode
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-6256
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6256
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Alexandr Fedotov
>            Assignee: Andrew Mashenkov
>             Fix For: 2.3
>
>
> The assert is as follows:
> exception="java.lang.AssertionError: null
>  at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
>  at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
>  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>  at java.lang.Thread.run(Thread.java:745)
> Below is the sequence of steps that leads to the assertion error:
> 1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after an EVT_NODE_FAILED has been received.
> 2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
> 3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node ring is used to determine if a node is alive
> during DiscoCache creation
> 4) After that, the node initiates removal of all the nodes read in step 2
> 5) For each node, it sends an EVT_NODE_FAILED to the corresponding DiscoverySpiListener
> providing a topology containing all the nodes except already processed
> 6) This event gets into GridDiscoveryManager 
> 7) The node gets removed from alive nodes for every DiscoCache in discoCacheHist
> 8) Topology change is detected
> 9) Creation of a new DiscoCache is attempted. At this moment every remote node is not available due to the
> TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with empty alives
> 10) The event with the created DiscoCache and the new topology version is passed to DiscoveryWorker
> 11) The event is eventually handled by DiscoveryWorker and is recorded by DiscoveryWorker#recordEvent
> 12) The recording is handled by GridEventStorageManager which notifies every listener for this event type (EVT_NODE_FAILED)
> 13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
> It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache received with the event and enqueues it
> 14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and initialized
> 15) updateTopologies is called, which for each GridCacheContext gets its topology (GridDhtPartitionTopology)
> and calls GridDhtPartitionTopology#updateTopologyVersion
> 16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the GridDhtPartitionsExchangeFuture.
> The assigned DiscoCache has empty alives at the moment
> 15) A distributed exchange is handled (GridDhtPartitionsExchangeFuture#distributedExchange)
> 16) For each cache context GridCacheContext, for its topology (GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is called
> 17) The fact that the node has left is determined and GridDhtPartitionTopologyImpl#removeNode is called to handle it
> 18) An attempt is made to get the alive coordinator node by calling DiscoCache#oldestAliveServerNode
> 19) null is returned which results in an AssertionError
> The fix should probably prevent initiating exchange futures if a node has segmented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)