You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ivan Bessonov (Jira)" <ji...@apache.org> on 2021/10/19 07:44:00 UTC
[jira] [Commented] (IGNITE-15733) Eventually failure of baseline registration.

    [ https://issues.apache.org/jira/browse/IGNITE-15733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430373#comment-17430373 ] 

Ivan Bessonov commented on IGNITE-15733:
----------------------------------------

It feels like the problem is fundamental. I think it can also be reproduced if you have a fresh clean node that starts first and then "accepts" nodes from cluster that has previously been activated.

Core reason of the problem is that there's an assumption in code that the first node that joins itself in cluster will decide cluster's tag and id. This is not necessarily correct as we see.

Old solution from development branch (back in 2019) was to only assign cluster id upon its first activation. But that was suddenly changed when we allowed to write into DMS before activation. Maybe if we return that old behavior for tag&id specifically and also wait for distributed metastorage data to persist on activation, this will solve the issue, but I'm not sure.

What I am sure about is that "local join" is not a proper place to initialize defaults that can differ on different nodes. So, a way to fix it as proposed by [~zstan] is to persist cluster tag (at least) once you receive it right in discovery thread. This will allow you to see if there was an attempt of setting this value before restart. In this case node shouldn't try setting it on local join and instead just wait. I'm sure there are corner cases to this solution as well. For example - what if none of cluster nodes actually saved tag&id in metastorage? This also doesn't solve the issue of the first node in cluster being clean.

These are my thoughts. I have no working solution for now

 

> Eventually failure of baseline registration.
> --------------------------------------------
>
>                 Key: IGNITE-15733
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15733
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 2.11
>            Reporter: Evgeny Stanilovsky
>            Priority: Major
>         Attachments: _Community_Edition_Cache_9_18998_cut.log
>
>
> All info can be found in attached log:
> briefly: Cluster of 2 nodes with persistence, sequentially start nodes, activate, stop nodes using org.apache.ignite.Ignite#close, start nodes, activate:
> expected :
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag [id=some_id, tag=some_tag]
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag [id=some_id, tag=some_tag]
> but obtained (check attach)
>  
> 1 node : Cluster ID and tag has been read from metastorage: null
> 2 node : Cluster ID and tag has been read from metastorage: null
> stop
> start
> 1 node: Cluster ID and tag has been read from metastorage: null
> 2 node: Cluster ID and tag has been read from metastorage: ClusterIdAndTag [id=some_id, tag=some_tag]
> and as a result : 
> _Joining node has conflicting distributed metastorage data_
> [^_Community_Edition_Cache_9_18998_cut.log]
> this test MetricsConfigurationTest.testNodeRestart [1] is flaky
> [1][https://ci.ignite.apache.org/buildConfiguration/IgniteTests24Java8_Cache9/6220901?buildTab=tests&name=MetricsConfigurationTes&view=tests&status=passed&suite=org.apache.ignite.testsuites.IgniteCacheTestSuite9%3A+&package=org.apache.ignite.internal.metric&expandedTest=build%3A%28id%3A6220901%29%2Cid%3A576406]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)