You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "privacyfirst@codesandnotes.be" <pr...@codesandnotes.be> on 2021/08/18 19:45:39 UTC

Nodes splitting and going on their separate ways (each with different topology version)

Dear Team,

I've been banging my head against this issue for weeks now and I really 
hope you'll be able to help me out in understanding what's going wrong.

I have two Ignite nodes running on two OVH VPS with Ubuntu 18.04: let's 
call them "coordinator" and "backup".

The two nodes share the same configuration, including a 
TcpDiscoveryVmIpFinder with static IPs, an SSL context factory for 
exchanging data securely, etc. (see configuration at the end of the email)

The persisted caches are set so that data has one backup, the objective 
being that I have my data on two separate servers, in case OVH's servers 
burn down (which is exactly what happened to Strasbourg' site this 
year...). The code goes:

CacheConfiguration<String, User> configuration = new 
CacheConfiguration<String, User>()
                 .setSqlSchema(SQL_SCHEMA)
                 .setName(MY_CACHE)
                 .setCacheMode(CacheMode.PARTITIONED)
                 .setIndexedTypes(String.class, Entity.class)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
                 .setBackups(1)
                 .setAffinity(new RendezvousAffinityFunction(true));

So the nodes are started and for a few days all seems to go well. If I 
look at the baseline using control.sh I get:

     Control utility [ver. 2.10.0#20210310-sha1:bc24f6ba]
     2021 Copyright(C) Apache Software Foundation
     User: root
     Time: 2021-08-18T04:00:04.663
     Command [BASELINE] started
     Arguments: --baseline
--------------------------------------------------------------------------------
     Cluster state: active
     Current topology version: 5
     Baseline auto adjustment disabled: softTimeout=300000

     Current topology version: 5 (Coordinator: 
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, 
Address=vps611671.ovh.net/coordinator.ip.address, Order=2)

     Baseline nodes:
         ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, 
Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2
         ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, 
Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
     Number of baseline nodes: 2

     Other nodes not found.
     Command [BASELINE] finished with code: 0
     Control utility has completed execution at: 2021-08-18T04:00:06.313
     Execution time: 1650 ms

Unfortunately after a few days I go to check the baseline and noticed 
that each node has gone its own way: the coordinator moved to topVer 6 
and the backup node to topVer 7. Each node detects the other one as 
being OFFLINE.

Coordinator's baseline:

--------------------------------------------------------------------------------
     Cluster state: active
     Current topology version: 6
     Baseline auto adjustment disabled: softTimeout=300000

     Current topology version: 6 (Coordinator: 
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, 
Address=vps611671.ovh.net/coordinator.ip.address, Order=2)

     Baseline nodes:
         ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, 
Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2
         ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, State=OFFLINE
--------------------------------------------------------------------------------
     Number of baseline nodes: 2

     Other nodes not found.
     Command [BASELINE] finished with code: 0
     Control utility has completed execution at: 2021-08-18T19:46:56.847
     Execution time: 1173 ms

Backup's baseline:

--------------------------------------------------------------------------------
     Cluster state: active
     Current topology version: 7
     Baseline auto adjustment disabled: softTimeout=300000

     Current topology version: 7 (Coordinator: 
ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, 
Address=205.ip-54-38-190.eu/backup.ip.address, Order=5)

     Baseline nodes:
         ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, State=OFFLINE
         ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, 
Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
     Number of baseline nodes: 2

     Other nodes not found.
     Command [BASELINE] finished with code: 0
     Control utility has completed execution at: 2021-08-18T19:49:38.300
     Execution time: 397 ms

At this point I can't restart the "backup" node, otherwise I get this 
nice error:

Caused by: class org.apache.ignite.spi.IgniteSpiException: 
BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0) 
is not compatible with BaselineTopology in the cluster. Branching 
history of cluster BlT ([1060612220]) doesn't contain branching point 
hash of joining node BlT (173037243). Consider cleaning persistent 
storage of the node and adding it to the cluster again.
         at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052)
         at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197)
         at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472)
         at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154)
         at 
org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
         ... 13 more

The only option I've found so far is to remove the backup node from the 
coordinator's instance with a "--baseline remove", then essentially wipe 
out the "backup" node's work folder, restart it, and reattach it to the 
coordinator with a "--baseline add".

Obviously that's a drastic workaround, since that triggers a rebalancing 
of various gigs of data. Unfortunately it's the only workaround I've 
found so far (as suggested in 
https://stackoverflow.com/questions/66807984/clusters-baselinehistory-doesnt-contain-branching-point-hash-of-joining-node-b). 


I've looked at the logs: though it indeed seem that the nodes lose sight 
of each other I can't seem to find the reason of this splitting. I 
thought the whole point of a baseline was to keep the nodes' 
configuration together even when one or the other becomes unhealthy for 
a while...

Below I have included logs for the coordinator node, the backup node and 
the logs of the application that connects to Ignite (a Java-based app 
that connects using the client mode).

You can see that at 04:06:53 the app warns of a failed node (consistent 
ID points to the backup node), then a few seconds later spews an 
IgniteException followed by some Thread exceptions.

At exactly the same moment (04:06:53) the coordinator node is warning of 
a "Timed out waiting for message delivery receipt" followed by a "Failed 
to send message to next node". It then moves to topology version 6...

Something similar happens on the "backup" node at about the same time 
(time in logs is set two hours earlier on that server for some reason: 
it points 2:00:00 but it actually is 4:00:00). It ends up moving to its 
own topology version 7.

These three logs have been appended at the bottom of this email.


So I have two questions here:

1) When this happens, is there any way to "roll back" to topology 
version 5, which included both nodes and worked like a charm? Am I 
condemned to wiping out the offending node?

2) What on earth might be causing this? I'm pretty sure Ignite works 
great, otherwise ING wouldn't be using it. So I guess it must be 
something on my side? Any ideas?

Any light or suggestion you could bring me on this matter would be truly 
appreciated.

Thanks,

Diego


*** nodes configuration ***


<?xml version="1.0" encoding="UTF-8" ?>

<beans xmlns="http://www.springframework.org/schema/beans"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="
        http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">

     <bean class="org.apache.ignite.configuration.IgniteConfiguration">

         <property name="igniteInstanceName" value="production-ignite"/>
         <property name="metricsLogFrequency" value="0"/>
         <property name="clientMode" value="false"/>

         <property name="sslContextFactory">
             <bean class="org.apache.ignite.ssl.SslContextFactory">
                 <property name="keyStoreFilePath" 
value="/etc/apache-ignite/node.jks"/>
                 <property name="keyStorePassword" value="************"/>
                 <property name="trustStoreFilePath" 
value="/etc/apache-ignite/trust.jks"/>
                 <property name="trustStorePassword" value="************"/>
                 <property name="protocol" value="TLSv1.2"/>
             </bean>
         </property>

         <property name="binaryConfiguration">
             <bean 
class="org.apache.ignite.configuration.BinaryConfiguration">
                 <property name="idMapper">
                     <bean 
class="org.apache.ignite.binary.BinaryBasicIdMapper">
                         <property name="lowerCase" value="false"/>
                     </bean>
                 </property>
                 <property name="nameMapper">
                     <bean 
class="org.apache.ignite.binary.BinaryBasicNameMapper"/>
                 </property>
             </bean>
         </property>

         <property name="dataStorageConfiguration">
             <bean 
class="org.apache.ignite.configuration.DataStorageConfiguration">

                 <property name="defaultDataRegionConfiguration">
                     <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration">
                         <property name="persistenceEnabled" value="true"/>
                     </bean>
                 </property>

                 <property name="dataRegionConfigurations">
                     <list>
                         <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration">
                             <property name="name" value="Srp6aRegion"/>
                             <property name="persistenceEnabled" 
value="false"/>
                         </bean>

                         <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration">
                             <property name="name" 
value="NotificationsRegion"/>
                             <property name="persistenceEnabled" 
value="false"/>
                             <property name="pageEvictionMode" 
value="RANDOM_2_LRU"/>
                             <property name="initialSize" value="#{10L * 
1024 * 1024}"/><!-- 10Mb -->
                             <property name="maxSize" value="#{500L * 
1024 * 1024}"/><!-- 500Mb -->
                         </bean>
                     </list>
                 </property>
             </bean>
         </property>

         <property name="communicationSpi">
             <bean 
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
                 <property name="localPort" value="47100"/>
                 <property name="localPortRange" value="3"/>
             </bean>
         </property>

         <property name="discoverySpi">
             <bean 
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                 <property name="localPort" value="47500"/>
                 <property name="localPortRange" value="3"/>
                 <property name="ipFinder">
                     <bean 
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                         <property name="addresses">
                             <list>
<value>127.0.0.1:47500..47501</value>
<value>coordinator.ip.address</value>
<value>backup.ip.address</value>
                             </list>
                         </property>
                     </bean>
                 </property>
             </bean>
         </property>
     </bean>
</beans>


*** Application logs ***


2021-08-18 04:06:53.977  WARN 7663 --- 
[disco-event-worker-#56%local-client%] 
o.a.i.i.m.d.GridDiscoveryManager         : Node FAILED: TcpDiscoveryNode 
[id=703bd274-96ea-4567-8c2c-215efd859a2e, 
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet 
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, 
intOrder=4, lastExchangeTime=1628259536802, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
2021-08-18 04:07:10.160  WARN 7663 --- [jvm-pause-detector-worker] 
o.a.i.i.IgniteKernal%local-client        : Possible too long JVM pause: 
15328 milliseconds.
2021-08-18 04:07:10.158 ERROR 7663 --- 
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] 
o.apache.ignite.internal.util.typedef.G  : Blocked system-critical 
thread has been detected. This can lead to cluster-wide undefined 
behaviour [workerName=partition-exchanger, 
threadName=exchange-worker-#57%local-client%, blockedFor=15s]
2021-08-18 04:07:10.200  WARN 7663 --- 
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] 
ROOT                                     : Possible failure suppressed 
accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler 
[tryStop=false, timeout=0, super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=partition-exchanger, igniteInstanceName=local-client, 
finished=false, heartbeatTs=1629252430180]]]

org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, 
igniteInstanceName=local-client, finished=false, heartbeatTs=1629252430180]
     at java.util.Arrays.fill(Arrays.java:2846) ~[na:1.8.0_292]
     at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.CachePartitionFullCountersMap.clear(CachePartitionFullCountersMap.java:99) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:599) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initTopologies(GridDhtPartitionsExchangeFuture.java:1204) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1037) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3373) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3195) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
~[ignite-core-2.10.0.jar!/:2.10.0]
     at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_292]

2021-08-18 04:07:10.218  WARN 7663 --- 
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] 
o.a.i.i.p.failure.FailureProcessor       : No deadlocked threads detected.
2021-08-18 04:07:10.357  WARN 7663 --- 
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] 
o.a.i.i.p.failure.FailureProcessor       : Thread dump at 2021/08/18 
04:07:10 CEST
Thread [name="sys-#23585%local-client%", id=25717, state=TIMED_WAITING, 
blockCnt=0, waitCnt=1]
     Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723, 
ownerName=null, ownerId=-1]
         at sun.misc.Unsafe.park(Native Method)
         at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
         at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
         at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)

Thread [name="sys-#23584%local-client%", id=25716, state=TIMED_WAITING, 
blockCnt=0, waitCnt=1]
     Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723, 
ownerName=null, ownerId=-1]
         at sun.misc.Unsafe.park(Native Method)
         at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
         at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)


*** "coordinator" node logs ***


[04:06:53,070][WARNING][tcp-disco-msg-worker-[703bd274 
backup.ip.address:47500 
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] 
Timed out waiting for message delivery receipt (most probably, the 
reason is in long GC pauses on remote node; consider tuning GC and 
increasing 'ackTimeout' configuration property). Will retry to send 
message with increased timeout [currentTimeout=9490, 
rmtAddr=205.ip-54-38-190.eu/backup.ip.address:47500, rmtPort=47500]
[04:06:53,120][WARNING][tcp-disco-msg-worker-[703bd274 
backup.ip.address:47500 
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] 
Failed to send message to next node 
[msg=TcpDiscoveryConnectionCheckMessage 
[super=TcpDiscoveryAbstractMessage [sndNodeId=null, 
id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, 
isClient=false]], next=TcpDiscoveryNode 
[id=703bd274-96ea-4567-8c2c-215efd859a2e, 
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet 
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, 
intOrder=4, lastExchangeTime=1628259536555, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to 
send message to next node [msg=TcpDiscoveryConnectionCheckMessage 
[super=TcpDiscoveryAbstractMessage [sndNodeId=null, 
id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, 
isClient=false]], next=ClusterNode 
[id=703bd274-96ea-4567-8c2c-215efd859a2e, order=5, 
addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], daemon=false]]]
[04:06:53,132][WARNING][tcp-disco-msg-worker-[703bd274 
backup.ip.address:47500 
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] 
Local node has detected failed nodes and started cluster-wide procedure. 
To speed up failure detection please see 'Failure Detection' section 
under javadoc for 'TcpDiscoverySpi'
[04:06:53,214][INFO][tcp-disco-sock-reader-[703bd274 
backup.ip.address:45841]-#10%production-ignite%-#185%production-ignite%][TcpDiscoverySpi] 
Finished serving remote node connection 
[rmtAddr=/backup.ip.address:45841, rmtPort=45841
[04:06:53,362][WARNING][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] 
Node FAILED: TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, 
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet 
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, 
intOrder=4, lastExchangeTime=1628259536555, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
[04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] 
Topology snapshot [ver=6, locNode=c79804fa, servers=1, clients=1, 
state=ACTIVE, CPUs=2, offheap=3.5GB, heap=3.4GB]
[04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] 
^-- Baseline [id=11, size=2, online=1, offline=1]
[04:06:53,942][INFO][exchange-worker-#57%production-ignite%][time] 
Started exchange init [topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], crd=true, evt=NODE_FAILED, 
evtNode=703bd274-96ea-4567-8c2c-215efd859a2e, customEvt=null, 
allowMerge=false, exchangeFreeSwitch=true]
[04:06:54,030][INFO][sys-#116790%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,032][INFO][exchange-worker-#57%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,035][INFO][sys-#116796%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,036][INFO][sys-#116793%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,038][INFO][sys-#116792%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,043][INFO][sys-#116791%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,092][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partition release future 
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, 
futInfo=NA, mode=DISTRIBUTED]
[04:06:54,097][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partitions release latch: ServerLatch [permits=0, 
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid 
[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]]]
[04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partition release future 
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, 
futInfo=NA, mode=DISTRIBUTED]
[04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partitions release latch: ServerLatch [permits=0, 
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid 
[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]]]
[04:06:54,240][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Completed partition exchange 
[localNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode 
[id=703bd274-96ea-4567-8c2c-215efd859a2e, 
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet 
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, 
intOrder=4, lastExchangeTime=1628259536555, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true, 
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]
[04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange timings [startVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], stage="Waiting in exchange queue" (39 ms), 
stage="Exchange parameters initialization" (30 ms), stage="Determine 
exchange type" (131 ms), stage="Preloading notification" (5 ms), 
stage="Wait partitions release [latch=exchange-free-replicated]" (9 ms), 
stage="Wait partitions release latch [latch=exchange-free-replicated]" 
(5 ms), stage="Wait partitions release 
[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions 
release latch [latch=exchange-free-partitioned]" (0 ms), stage="After 
states restored callback" (18 ms), stage="WAL history reservation" (19 
ms), stage="Finalize update counters" (52 ms), stage="Detect lost 
partitions" (57 ms), stage="Exchange done" (61 ms), stage="Total time" 
(426 ms)]
[04:06:54,316][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange longest local stages [startVer=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], stage="Affinity initialization (exchange-free switch on 
fully-rebalanced topology) [grp=MessageLifeCache]" (108 ms) 
(parent=Determine exchange type), stage="Affinity initialization 
(exchange-free switch on fully-rebalanced topology) [grp=CliauthCache]" 
(105 ms) (parent=Determine exchange type), stage="Affinity 
initialization (exchange-free switch on fully-rebalanced topology) 
[grp=UserEventCache]" (100 ms) (parent=Determine exchange type)]
[04:06:54,357][INFO][exchange-worker-#57%production-ignite%][time] 
Finished exchange init [topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], crd=true]
[04:06:54,473][INFO][exchange-worker-#57%production-ignite%][GridCachePartitionExchangeManager] 
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED, 
node=703bd274-96ea-4567-8c2c-215efd859a2e]


*** "backup" node logs ***
(remember there's a 2 hour difference in the logs. 02:06:54 is in fact 
04:06:54)


[02:06:54,048][WARNING][tcp-disco-msg-worker-[c79804fa 
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] 
Timed out waiting for message delivery receipt (most probably, the 
reason is in long GC pauses on remote node; consider tuning GC and 
increasing 'ackTimeout' configuration property). Will retry to send 
message with increased timeout [currentTimeout=9494, 
rmtAddr=vps611671.ovh.net/coordinator.ip.address:47500, rmtPort=47500]
[02:06:54,125][WARNING][tcp-disco-msg-worker-[c79804fa 
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] 
Failed to send message to next node 
[msg=TcpDiscoveryConnectionCheckMessage 
[super=TcpDiscoveryAbstractMessage [sndNodeId=null, 
id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e, 
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, 
isClient=false]], next=TcpDiscoveryNode 
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, 
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, 
intOrder=2, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to 
send message to next node [msg=TcpDiscoveryConnectionCheckMessage 
[super=TcpDiscoveryAbstractMessage [sndNodeId=null, 
id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e, 
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, 
isClient=false]], next=ClusterNode 
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, order=2, 
addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
daemon=false]]]
[02:06:54,130][WARNING][tcp-disco-msg-worker-[c79804fa 
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] 
Local node has detected failed nodes and started cluster-wide procedure. 
To speed up failure detection please see 'Failure Detection' section 
under javadoc for 'TcpDiscoverySpi'
[02:06:54,135][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
Node FAILED: TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, 
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, 
intOrder=2, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
[02:06:54,137][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
Topology snapshot [ver=6, locNode=703bd274, servers=1, clients=1, 
state=ACTIVE, CPUs=4, offheap=3.5GB, heap=3.4GB]
[02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
Coordinator changed [prev=TcpDiscoveryNode 
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, 
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, 
intOrder=2, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], cur=TcpDiscoveryNode 
[id=703bd274-96ea-4567-8c2c-215efd859a2e, 
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet 
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, 
intOrder=4, lastExchangeTime=1629252404041, loc=true, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]]
[02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
^-- Baseline [id=11, size=2, online=1, offline=1]
[02:06:54,149][INFO][disco-event-worker-#59%production-ignite%][MvccProcessorImpl] 
Assigned mvcc coordinator [crd=MvccCoordinator 
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], 
nodeId=703bd274-96ea-4567-8c2c-215efd859a2e, ver=1628257415621, 
local=true, initialized=false]]
[02:06:54,167][INFO][sys-#122238%production-ignite%][ExchangeLatchManager] 
Become new coordinator 703bd274-96ea-4567-8c2c-215efd859a2e
[02:06:54,168][INFO][exchange-worker-#61%production-ignite%][time] 
Started exchange init [topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], crd=true, evt=NODE_FAILED, 
evtNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, customEvt=null, 
allowMerge=false, exchangeFreeSwitch=true]
[02:06:54,170][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,171][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,172][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,175][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,177][INFO][sys-#122240%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,185][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache] 
Local node affinity assignment distribution is not ideal 
[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024, 
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,195][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partition release future 
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, 
futInfo=NA, mode=DISTRIBUTED]
[02:06:54,196][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partitions release latch: ServerLatch [permits=0, 
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid 
[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]]]
[02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partition release future 
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, 
futInfo=NA, mode=DISTRIBUTED]
[02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finished waiting for partitions release latch: ServerLatch [permits=0, 
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid 
[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]]]
[02:06:54,229][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Completed partition exchange 
[localNode=703bd274-96ea-4567-8c2c-215efd859a2e, 
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode 
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, 
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, 
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, 
intOrder=2, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true, 
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0]]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange timings [startVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], stage="Waiting in exchange queue" (0 ms), 
stage="Exchange parameters initialization" (0 ms), stage="Determine 
exchange type" (18 ms), stage="Preloading notification" (0 ms), 
stage="Wait partitions release [latch=exchange-free-replicated]" (8 ms), 
stage="Wait partitions release latch [latch=exchange-free-replicated]" 
(1 ms), stage="Wait partitions release 
[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions 
release latch [latch=exchange-free-partitioned]" (0 ms), stage="After 
states restored callback" (0 ms), stage="WAL history reservation" (3 
ms), stage="Finalize update counters" (10 ms), stage="Detect lost 
partitions" (25 ms), stage="Exchange done" (6 ms), stage="Total time" 
(71 ms)]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange longest local stages [startVer=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], stage="Affinity initialization (exchange-free switch on 
fully-rebalanced topology) [grp=ParameterCache]" (13 ms) 
(parent=Determine exchange type), stage="Affinity initialization 
(exchange-free switch on fully-rebalanced topology) [grp=UserCache]" (10 
ms) (parent=Determine exchange type), stage="Affinity initialization 
(exchange-free switch on fully-rebalanced topology) 
[grp=UserEventCache]" (9 ms) (parent=Determine exchange type)]
[02:06:54,243][INFO][exchange-worker-#61%production-ignite%][time] 
Finished exchange init [topVer=AffinityTopologyVersion [topVer=6, 
minorTopVer=0], crd=true]
[02:06:54,253][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager] 
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED, 
node=c79804fa-52e8-457f-88c8-eeb78b3f34fa]
[02:07:14,182][WARNING][tcp-disco-msg-worker-[c79804fa 
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] 
Failing client node due to not receiving metrics updates from client 
node within 'IgniteConfiguration.clientFailureDetectionTimeout' 
(consider increasing configuration property) [timeout=30000, 
node=TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, 
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, 
intOrder=3, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]]
[02:07:14,183][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
Node FAILED: TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, 
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, 
intOrder=3, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]
[02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
Topology snapshot [ver=7, locNode=703bd274, servers=1, clients=0, 
state=ACTIVE, CPUs=2, offheap=3.5GB, heap=1.7GB]
[02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] 
^-- Baseline [id=11, size=2, online=1, offline=1]
[02:07:14,189][INFO][exchange-worker-#61%production-ignite%][time] 
Started exchange init [topVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], crd=true, evt=NODE_FAILED, 
evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, customEvt=null, 
allowMerge=true, exchangeFreeSwitch=false]
[02:07:14,190][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Finish exchange future [startVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Completed partition exchange 
[localNode=703bd274-96ea-4567-8c2c-215efd859a2e, 
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion 
[topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode 
[id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, 
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList 
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], 
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, 
intOrder=3, lastExchangeTime=1628259538158, loc=false, 
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true], rebalanced=true, 
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0]]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange timings [startVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], stage="Waiting in exchange queue" (0 ms), 
stage="Exchange parameters initialization" (0 ms), stage="Determine 
exchange type" (0 ms), stage="Exchange done" (2 ms), stage="Total time" 
(2 ms)]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] 
Exchange longest local stages [startVer=AffinityTopologyVersion 
[topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0]]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][time] 
Finished exchange init [topVer=AffinityTopologyVersion [topVer=7, 
minorTopVer=0], crd=true]
[02:07:14,195][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager] 
Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion 
[topVer=7, minorTopVer=0], evt=NODE_FAILED, 
evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, client=false]