You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by John Smith <ja...@gmail.com> on 2020/11/17 17:51:17 UTC

What is considered high IO wait and partition exchange fiallure?

So if I understand correctly the logs below...

The node that shut off was timing out trying to get partition exchange from
the indicated nodes and it shut itself off correct? Does this mean this
node was also the master?

1- The time indicated in the log is that UTC?
2- I'm trying to see if it was high IO, but the node that stopped had no
high IO.
3- the other nodes had an average peek of about 0.3%-0.5% And that was
around 16:30 EST So not sure if those times match with the log.
4- On super high load days we get up to 7%-9% IO Wait but we never lost a
node on those occasions.
5- The nodes are persistent nodes so I'm guessing on write there's high IO
can we reduce the commit time? but even then I'm not sure it's related. As
even on higher load we don't lose the node.

[04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=132, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=e5645874-77e9-4455-8db5-07fa63984276, addrs=[127.0.0.1, xxx.xxx.xxx.5],
sockAddrs=[/xxx.xxx.xxx.5:0, /127.0.0.1:0], discPort=0, order=108,
intOrder=60, lastExchangeTime=1600798045116, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null,
durationFromInit=1605328567765]
[04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish
exchange future [startVer=AffinityTopologyVersion [topVer=133,
minorTopVer=0], resVer=null, err=class
org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is
stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=133, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=13f473cb-5441-460b-9b9f-a23c1bcf3b0b, addrs=[127.0.0.1, xxx.xxx.xxx.3],
sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.3:0], discPort=0, order=110,
intOrder=61, lastExchangeTime=1600798045127, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null,
durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish
exchange future [startVer=AffinityTopologyVersion [topVer=134,
minorTopVer=0], resVer=null, err=class
org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is
stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=134, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=ef9f3750-edfa-474e-8e68-6f0abee5095a, addrs=[127.0.0.1,
xxx.xxx.xxx.10], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.10:0], discPort=0,
order=112, intOrder=62, lastExchangeTime=1600798045137, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null,
durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish
exchange future [startVer=AffinityTopologyVersion [topVer=135,
minorTopVer=0], resVer=null, err=class
org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is
stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=135, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=e14e5bea-2784-4a6f-af2f-e98a4bf0ab52, addrs=[127.0.0.1,
xxx.xxx.xxx.63], sockAddrs=[xxxxxx-0001/xxx.xxx.xxx.63:47500, /
127.0.0.1:47500], discPort=47500, order=118, intOrder=65,
lastExchangeTime=1600798763474, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null,
durationFromInit=1605328567765]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish
exchange future [startVer=AffinityTopologyVersion [topVer=136,
minorTopVer=0], resVer=null, err=class
org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is
stopping: xxxxxx]
[04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=136, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=8facbe9c-6d9f-4c2d-8d9c-5c7f38dbb7da, addrs=[127.0.0.1,
xxx.xxx.xxx.69], sockAddrs=[/127.0.0.1:47500,
xxxxxx-0003/xxx.xxx.xxx.69:47500], discPort=47500, order=120, intOrder=66,
lastExchangeTime=1600799256509, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null,
durationFromInit=1605328567765]
[04:36:07,796][INFO][db-checkpoint-thread-#56%xxxxxx%][GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms,
checkpointLockHoldTime=14ms, reason='timeout']
[04:36:07,832][INFO][node-stopper][GridCacheProcessor] Stopped cache
[cacheName=carrier-ids-for-phones]
[04:36:07,839][INFO][node-stopper][GridCacheProcessor] Stopped cache
[cacheName=twofa-otp]
[04:36:07,840][INFO][node-stopper][GridCacheProcessor] Stopped cache
[cacheName=SQL_PUBLIC_CARRIER_CODE]
[04:36:07,841][INFO][node-stopper][GridCacheProcessor] Stopped cache
[cacheName=carrier-lookups]
[04:36:07,842][INFO][node-stopper][GridCacheProcessor] Stopped cache
[cacheName=ignite-sys-cache]
[04:36:08,497][INFO][node-stopper][IgniteKernal%xxxxxx]

>>>
+---------------------------------------------------------------------------------+
>>> Ignite ver.
2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db stopped OK
>>>
+---------------------------------------------------------------------------------+
>>> Ignite instance name: xxxxxx
>>> Grid uptime: 52 days, 10:25:51.029

Re: What is considered high IO wait and partition exchange fiallure?

Posted by Maxim Muzafarov <mm...@apache.org>.
Hello John,

Can you share the full log and cluster configuration?
For instance, the cluster may fail the node due to unresponsive socket
request if this node had long GC pauses due to high load.

> Does this mean this node was also the master?

Not sure I've got your question. Do you mean the coordinator node?

On Tue, 17 Nov 2020 at 20:52, John Smith <ja...@gmail.com> wrote:
>
> So if I understand correctly the logs below...
>
> The node that shut off was timing out trying to get partition exchange from the indicated nodes and it shut itself off correct? Does this mean this node was also the master?
>
> 1- The time indicated in the log is that UTC?
> 2- I'm trying to see if it was high IO, but the node that stopped had no high IO.
> 3- the other nodes had an average peek of about 0.3%-0.5% And that was around 16:30 EST So not sure if those times match with the log.
> 4- On super high load days we get up to 7%-9% IO Wait but we never lost a node on those occasions.
> 5- The nodes are persistent nodes so I'm guessing on write there's high IO can we reduce the commit time? but even then I'm not sure it's related. As even on higher load we don't lose the node.
>
> [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=132, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e5645874-77e9-4455-8db5-07fa63984276, addrs=[127.0.0.1, xxx.xxx.xxx.5], sockAddrs=[/xxx.xxx.xxx.5:0, /127.0.0.1:0], discPort=0, order=108, intOrder=60, lastExchangeTime=1600798045116, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=13f473cb-5441-460b-9b9f-a23c1bcf3b0b, addrs=[127.0.0.1, xxx.xxx.xxx.3], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.3:0], discPort=0, order=110, intOrder=61, lastExchangeTime=1600798045127, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=ef9f3750-edfa-474e-8e68-6f0abee5095a, addrs=[127.0.0.1, xxx.xxx.xxx.10], sockAddrs=[/127.0.0.1:0, /xxx.xxx.xxx.10:0], discPort=0, order=112, intOrder=62, lastExchangeTime=1600798045137, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e14e5bea-2784-4a6f-af2f-e98a4bf0ab52, addrs=[127.0.0.1, xxx.xxx.xxx.63], sockAddrs=[xxxxxx-0001/xxx.xxx.xxx.63:47500, /127.0.0.1:47500], discPort=47500, order=118, intOrder=65, lastExchangeTime=1600798763474, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx]
> [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=8facbe9c-6d9f-4c2d-8d9c-5c7f38dbb7da, addrs=[127.0.0.1, xxx.xxx.xxx.69], sockAddrs=[/127.0.0.1:47500, xxxxxx-0003/xxx.xxx.xxx.69:47500], discPort=47500, order=120, intOrder=66, lastExchangeTime=1600799256509, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765]
> [04:36:07,796][INFO][db-checkpoint-thread-#56%xxxxxx%][GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=14ms, reason='timeout']
> [04:36:07,832][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-ids-for-phones]
> [04:36:07,839][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=twofa-otp]
> [04:36:07,840][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=SQL_PUBLIC_CARRIER_CODE]
> [04:36:07,841][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-lookups]
> [04:36:07,842][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
> [04:36:08,497][INFO][node-stopper][IgniteKernal%xxxxxx]
>
> >>> +---------------------------------------------------------------------------------+
> >>> Ignite ver. 2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db stopped OK
> >>> +---------------------------------------------------------------------------------+
> >>> Ignite instance name: xxxxxx
> >>> Grid uptime: 52 days, 10:25:51.029