You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by wentat <we...@rakuten.com> on 2020/02/13 07:10:17 UTC

Null Pointer Error in GridDhtPartitionsExchangeFuture

Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
different scenarios:
1. Swap rebalance - kill a node, then add a new node in
2. Scale up - add a new node in
3. Scale down - kill a node

I have a cluster with 30 nodes, with a huge dataset of 450 million items.

Test 1

In scenario 1: 
I started node 31 and killed node 1. Node 31 was not in the base topology
but they share the same XML file so the cluster detected it. I then used
control.sh --baseline remove node1 which is offline and added node 31 which
is outside of the original topology. This step works fine

In scenario 2:
I started node 1 and added back to the cluster via the steps above, then
suddenly 3 other nodes in the cluster crashed. The reasoning could be
because of me not removing the old work directory in node 1. Anyways the
results I got from the crashed servers are:

```
java.lang.NullPointerException
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
    at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
    at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
    at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
```

and

```
class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
Failed to send message (node left topology): TcpDiscoveryNode
[id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
someip:47500],
discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
[12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
```

Test 2

I started the test again, 

Scenario 1, I removed node 1 and added node 31, seems ok

Scenario 2, I added node 1 after *removing all data files in node 1*, all
seems to be fine

Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
new error:

```
Locked synchronizers:
        java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
waitCnt=5449360]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
        at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at
o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
```

And a whole lot more messages about locked synchronizers, followed by:

```
class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
entry (will stop rebalancing) [local=TcpDiscoveryNode
[id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=36, lastExchangeTime=1581563518767, loc=true,
ver=2.7.0#20181201-sha1:256ae401, isClient=false],
node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl [part=372,
val=user3974044929057811550, hasValBytes=true], part=372]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.NodeStoppingException: Operation
has been cancelled (node is stopping).
        at
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
        at
org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
        ... 17 more
```

Configurations

30 servers, Ignite 2.7, no client connected, attached is the XML config
file:  ignite-sql.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml>  
I didn't define fault rebalance mode, so it should be ASYNC and partition
loss policy should be IGNORE

My question is:

In general, what are the steps to follow to scale up/down the cluster or
remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just run
`control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
original baseline topology? How about re-adding new nodes that were
previously killed? Do we need to remove any files? How long does it take for
the nodes to synchronise? How do we know when a rebalance is completed?

Sorry for my many questions, I am new to Ignite and any help is appreciated!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I have no idea, I recommend collecting a heap dump and analyzing it to
locate any leaks. I think that something would indeed happen at the cluster
in that time.

Regards,
-- 
Ilya Kasnacheev


пт, 21 февр. 2020 г. в 06:07, wentat <we...@rakuten.com>:

> Hi Ilya,
>
> at the time of running the experiments in the logs, there was no queries
> running in the background. Just 30 servers and no clients. What could have
> caused such high heap usage?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by wentat <we...@rakuten.com>.
Hi Ilya, 

at the time of running the experiments in the logs, there was no queries
running in the background. Just 30 servers and no clients. What could have
caused such high heap usage?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

For example, you can do a SQL request with large result set, such as SELECT
* without WHERE clause, which may cause server node to run out of memory.

Regards,
-- 
Ilya Kasnacheev


чт, 20 февр. 2020 г. в 06:23, wentat <we...@rakuten.com>:

> Hi Ilya,
>
> Thank you for your response. I have checked my client side load testing
> library (I am using YCSB btw) and I found a potential memory leak. However,
> can the client side using too much heap cause the server to fail? There is
> no other applications running on the Apache Ignite servers
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by wentat <we...@rakuten.com>.
Hi Ilya,

Thank you for your response. I have checked my client side load testing
library (I am using YCSB btw) and I found a potential memory leak. However,
can the client side using too much heap cause the server to fail? There is
no other applications running on the Apache Ignite servers



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

From this log:

[17:19:09,949][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 1405 milliseconds.
[17:19:12,237][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 1983 milliseconds.
[17:19:14,416][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2029 milliseconds.
[17:19:16,619][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2103 milliseconds.
[17:19:18,948][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2279 milliseconds.
[17:19:21,217][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2219 milliseconds.
[17:19:23,268][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2001 milliseconds.
[17:19:25,028][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 1710 milliseconds.
[17:19:28,814][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 3736 milliseconds.
[17:19:30,962][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 2098 milliseconds.
[17:19:32,553][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 1541 milliseconds.
[17:19:37,938][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 3837 milliseconds.
[17:19:51,271][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 13200 milliseconds.
[17:19:57,222][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 7482 milliseconds.
[17:20:17,384][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 5832 milliseconds.
[17:20:17,384][SEVERE][exchange-worker-#43][G] Blocked system-critical
thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=grid-timeout-worker, blockedFor=10s]
[17:20:36,342][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Timed out
waiting for message delivery receipt (most probably, the reason is in long
GC pauses on remote node; consider tuning GC and increasing 'ackTimeout'
configuration property). Will retry to send message with increased timeout
[currentTimeout=10000, rmtAddr=server: 2016/redacted_ip:47500,
rmtPort=47500]
[17:20:36,342][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/redacted_ip, rmtPort=56925]
[17:20:36,342][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible
too long JVM pause: 30741 milliseconds.
[17:20:42,276][SEVERE][nio-acceptor-tcp-rest-#39][GridTcpRestProtocol]
Runtime error caught during grid runnable execution: GridWorker
[name=nio-acceptor-tcp-rest, igniteInstanceName=null, finished=false,
heartbeatTs=1581322824712, hashCode=328613569, interrupted=false,
runner=nio-acceptor-tcp-rest-#39]
*java.lang.OutOfMemoryError: GC overhead limit exceeded*

So, you have plainly run out of heap, and Ignite is likely not to blame
since we are not using a lot of heap.

I recommend collecting heap dumps, searching for leaks in your own code /
use patterns.

Regards,
-- 
Ilya Kasnacheev


ср, 19 февр. 2020 г. в 07:01, wentat <we...@rakuten.com>:

> Hi Ilya,
>
> Thank you for your reply. I have done this test a few times and I
> consistently get stalling grids during failover/scaling/server swapping
>
> I have tried tuning some parameters, according to  ignite production prep
> docs <https://apacheignite.readme.io/docs/preparing-for-production>  . I
> have increased the heap size to max of 10GB, removed logging of metrics and
> set igcfg.setFailureDetectionTimeout(60000); - one hour! However, this was
> done after the 2 tries in this thread.
>
> I will try to run one time and get logs for whole cluster including GC if
> problem persists but it will take some time as I have moved on to other
> tests. Meanwhile, here is the original log from my first experiment. Maybe
> you can have a clue.
>
> Once again, thank you for your time in this issue
>
> crash.log
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2779/crash.log>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by wentat <we...@rakuten.com>.
Hi Ilya, 

Thank you for your reply. I have done this test a few times and I
consistently get stalling grids during failover/scaling/server swapping

I have tried tuning some parameters, according to  ignite production prep
docs <https://apacheignite.readme.io/docs/preparing-for-production>  . I
have increased the heap size to max of 10GB, removed logging of metrics and
set igcfg.setFailureDetectionTimeout(60000); - one hour! However, this was
done after the 2 tries in this thread.

I will try to run one time and get logs for whole cluster including GC if
problem persists but it will take some time as I have moved on to other
tests. Meanwhile, here is the original log from my first experiment. Maybe
you can have a clue.

Once again, thank you for your time in this issue

crash.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/crash.log>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Better issue TERM (kill without -9) so that node can at least gracefully
shutdown its file descriptors.

Otherwise, the first error looks like some one-off bug, and "Operation has
been cancelled (node is stopping)." are self-descriptive and normal.

Unfortunately we would need to take a look at all logs from all nodes to
understand why your grid was stalling.

Regards,
-- 
Ilya Kasnacheev


чт, 13 февр. 2020 г. в 10:10, wentat <we...@rakuten.com>:

> Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
> different scenarios:
> 1. Swap rebalance - kill a node, then add a new node in
> 2. Scale up - add a new node in
> 3. Scale down - kill a node
>
> I have a cluster with 30 nodes, with a huge dataset of 450 million items.
>
> Test 1
>
> In scenario 1:
> I started node 31 and killed node 1. Node 31 was not in the base topology
> but they share the same XML file so the cluster detected it. I then used
> control.sh --baseline remove node1 which is offline and added node 31 which
> is outside of the original topology. This step works fine
>
> In scenario 2:
> I started node 1 and added back to the cluster via the steps above, then
> suddenly 3 other nodes in the cluster crashed. The reasoning could be
> because of me not removing the old work directory in node 1. Anyways the
> results I got from the crashed servers are:
>
> ```
> java.lang.NullPointerException
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
>     at
>
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
>     at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
>     at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
>     at
>
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
>     at
>
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>     at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>     at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> ```
>
> and
>
> ```
> class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
> Failed to send message (node left topology): TcpDiscoveryNode
> [id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
> addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
> someip:47500],
> discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
> loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
>     at
>
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
>     at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
>     at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
>     at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
>     at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> [12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
> ```
>
> Test 2
>
> I started the test again,
>
> Scenario 1, I removed node 1 and added node 31, seems ok
>
> Scenario 2, I added node 1 after *removing all data files in node 1*, all
> seems to be fine
>
> Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
> new error:
>
> ```
> Locked synchronizers:
>         java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
> Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
> waitCnt=5449360]
>         at sun.misc.Unsafe.park(Native Method)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
>         at
> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>         at
>
> o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
>         at
>
> o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
>         at
>
> o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
>         at
>
> o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
>         at
>
> o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
>         at
>
> o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> ```
>
> And a whole lot more messages about locked synchronizers, followed by:
>
> ```
> class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
> entry (will stop rebalancing) [local=TcpDiscoveryNode
> [id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
> sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
> intOrder=36, lastExchangeTime=1581563518767, loc=true,
> ver=2.7.0#20181201-sha1:256ae401, isClient=false],
> node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl
> [part=372,
> val=user3974044929057811550, hasValBytes=true], part=372]
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
>         at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: class org.apache.ignite.internal.NodeStoppingException:
> Operation
> has been cancelled (node is stopping).
>         at
>
> org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
>         at
>
> org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
>         ... 17 more
> ```
>
> Configurations
>
> 30 servers, Ignite 2.7, no client connected, attached is the XML config
> file:  ignite-sql.xml
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml>
>
> I didn't define fault rebalance mode, so it should be ASYNC and partition
> loss policy should be IGNORE
>
> My question is:
>
> In general, what are the steps to follow to scale up/down the cluster or
> remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just
> run
> `control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
> original baseline topology? How about re-adding new nodes that were
> previously killed? Do we need to remove any files? How long does it take
> for
> the nodes to synchronise? How do we know when a rebalance is completed?
>
> Sorry for my many questions, I am new to Ignite and any help is
> appreciated!
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: How to failover/scale cluster in Apache Ignite

Posted by wentat <we...@rakuten.com>.
Ok, I'll try to get a reproducer. However, I think its pretty hard because
the error seems to be transient errors related to failover with huge dataset
(1 TB plus dataset). My follow up question would be:

If kill -9 is not appropriate. What is the graceful way to failover a node?

For a 1TB dataset, is 30 nodes a good setup? One node takes about 35GB of
ram but I have given it 46GB



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: How to failover/scale cluster in Apache Ignite

Posted by Vladimir Pligin <vo...@yandex.ru>.
Hi, I'll try to do my best to help you.

>> Is kill -9 <pid> the right way to kill a node?

No, I don't think this is the right way. 

>> How about re-adding new nodes that were previously killed?

You should clean a node's work directory before re-adding.

>> How long does it take for the nodes to synchronise?

It depends on your network, data volume, disk(s) speed, data storage
configuration etc.

>> How do we know when a rebalance is completed?

You'll see a message in a log. Or you can use WebConsole.


By the way it would be great if you provide some sort of a reproducer to
help us review your scenario.





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/