You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Kseniya Romanova <ro...@gmail.com> on 2020/10/19 14:33:24 UTC

Re: Server node failed to rejoin the cluster with exception

Hi Ping! Just in case the question is still relevant, you can join
tomorrow's Q&A session[1]   to reach Ignite developers with this question.

Cheers,
Kseniya

[1] https://www.meetup.com/Apache-Ignite-Virtual-Meetup/events/273921637/

вт, 4 авг. 2020 г. в 01:26, pinghao99 <pi...@gmail.com>:

> Hi All,
>
> I setup a 2.7.5 version 6 server nodes cluster.  cache A created with
> partition mode, backup = 1 cache use durable region. all nodes started,
> baseline number is 6.
>
> A ignite client started with baseline monitoring code copy from
>
> https://apacheignite.readme.io/v2.7.5/docs/baseline-topology#triggering-rebalancing-programmatically
>
> the client run a forever loop, it simply do single cache put of cache A
> every second.
>
> Then manually stop nodes one by one, at least few seconds between each
> stopping, all cache put were fine since cluster went through re-balancing
> when node left.
>
> Then gradually bring back ignite nodes, some of nodes rejoin cluster
> without
> error, however, it will always have node failed to join the cluster, with
> exceptions :
>
> [15:13:08] Security status [authentication=off, tls/ssl=off]
> [15:13:09] Ignite node stopped in the middle of checkpoint. Will restore
> memory state and finish checkpoint on node start.
> [15:13:09,487][SEVERE][main][IgniteKernal] Exception during start
> processors, node will be stopped and close connections
> class org.apache.ignite.IgniteCheckedException: Restoring of
> BaselineTopology history has failed, expected history item not found for
> id=0
>         at
>
> org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
>         at
>
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
>         at
> org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
>         at
> org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>         at
>
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>         at org.apache.ignite.Ignition.start(Ignition.java:348)
>         at
>
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> [15:13:09,489][SEVERE][main][IgniteKernal] Got exception while starting
> (will rollback startup routine).
> class org.apache.ignite.IgniteCheckedException: Restoring of
> BaselineTopology history has failed, expected history item not found for
> id=0
>         at
>
> org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
>         at
>
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
>         at
> org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
>         at
> org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>         at
>
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>         at org.apache.ignite.Ignition.start(Ignition.java:348)
>         at
>
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> [15:13:09] Ignite node stopped OK [uptime=00:00:01.800]
> class org.apache.ignite.IgniteException: Restoring of BaselineTopology
> history has failed, expected history item not found for id=0
>         at
>
> org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:1026)
>         at org.apache.ignite.Ignition.start(Ignition.java:351)
>         at
>
> org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
> Caused by: class org.apache.ignite.IgniteCheckedException: Restoring of
> BaselineTopology history has failed, expected history item not found for
> id=0
>         at
>
> org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
>         at
>
> org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
>         at
> org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
>         at
>
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
>         at
> org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
>         at
>
> org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
>         at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
>         at org.apache.ignite.Ignition.start(Ignition.java:348)
>         ... 1 more
> Failed to start grid: Restoring of BaselineTopology history has failed,
> expected history item not found for id=0
>
> ================
> Workaround is wipe out ignite data directory on the failed node, it can
> rejoin then without issue.
>
> This is pretty reproducible, and look like an ignite bug. A rejoined ignite
> node, even it hold outdated data, is not suppose to cause exception, the
> outdated data can be safely ignored, and let it rejoin the cluster with
> clean slate.
>
> This issue make our production deployment can not recover from sporadic
> node
> left / rejoin case.
>
> Is this same as unsolved issue
> https://issues.apache.org/jira/browse/IGNITE-12850?  I don't know what's
> metastorage means in the ticket.
>
> Any suggestion?
>
> Thanks & Regards
> Ping
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>