You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Sergey Chugunov (JIRA)" <ji...@apache.org> on 2017/07/04 15:45:00 UTC

[jira] [Updated] (IGNITE-5115) Investigation of failing tests of coordinator node failure

     [ https://issues.apache.org/jira/browse/IGNITE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Chugunov updated IGNITE-5115:
------------------------------------
    Fix Version/s:     (was: 2.1)
                   2.2

> Investigation of failing tests of coordinator node failure 
> -----------------------------------------------------------
>
>                 Key: IGNITE-5115
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5115
>             Project: Ignite
>          Issue Type: Task
>          Components: messaging
>            Reporter: Sergey Chugunov
>             Fix For: 2.2
>
>
> Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are flaky on TC and sometimes hang with the following assertion in logs:
> {code}
> Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" java.lang.AssertionError
> 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
> 	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}
> It seems that this happens because tests' implementation drops connections of *TcpCommunicatonSpi* on coordinator node with *simulateNodeFailure* method.
> At the same time tests leave *TcpDiscoverySpi* operational; it receives subsequent NodeFailed message and throws the assertion error shown above.
> The whole situation looks legitimate as it is possible to imagine a situation when CommSPI connections on coordinator fail for some reason while DiscoSPI connections are healthy.
> It is needed to investigate the situation deeper, figure out whether the root cause is using of *simulateNodeFailure* or not and propose a solution if the error may happen in the real life.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)