You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Kamlesh Joshi <Ka...@ril.com> on 2020/11/10 12:54:10 UTC

partition-exchanger system-critical thread blocked

Hi Igniters,

Sometimes 'Blocked system-critical thread' errors are getting printed in prod cluster logs. As per below logs, it's saying exchange-worker blocked for 60s. There is no node join/left or cluster activation or cache creation, still why partition exchange is triggered. Even it is triggered,  it is blocked for 60s which is huge time from prod perspective.

Below are error details,
[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]
[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]

Cluster is responding but these errors priting, we are not understanding what's the cause, could you please help us.

Below is log snippet,

[2020-11-09T10:51:21,458][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=34ms, walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout']
[2020-11-09T10:51:23,499][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663], markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms]
[2020-11-09T10:51:36,788][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279664, segIdx=4, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
[2020-11-09T10:51:36,954][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
[2020-11-09T10:52:02,018][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279665, segIdx=5, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
[2020-11-09T10:52:02,200][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
[2020-11-09T10:52:36,541][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279666, segIdx=6, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
[2020-11-09T10:52:36,703][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
[2020-11-09T10:53:11,068][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279667, segIdx=7, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
[2020-11-09T10:53:11,239][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
[2020-11-09T10:53:45,694][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279668, segIdx=8, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
[2020-11-09T10:53:45,868][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
[2020-11-09T10:54:21,411][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279669, segIdx=9, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
[2020-11-09T10:54:21,480][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272, startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=51ms, walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout']
[2020-11-09T10:54:21,578][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
[2020-11-09T10:54:23,624][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272, pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669], markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms]
[2020-11-09T10:54:45,727][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279670, segIdx=0, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
[2020-11-09T10:54:45,886][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
[2020-11-09T10:55:17,146][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279671, segIdx=1, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
[2020-11-09T10:55:17,343][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]
[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb, ownerName=null, ownerId=-1]

[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
                at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
                at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
                at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-11-09T10:55:18,894][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-11-09T10:55:21,404][WARN ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long JVM pause: 2477 milliseconds.
[2020-11-09T10:55:21,422][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at 2020/11/09 10:55:21 IST
Thread [name="sys-#213675%EDIFCustomerCC%", id=369652, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Thread [name="sys-#213674%EDIFCustomerCC%", id=369651, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


Thanks and Regards,
Kamlesh Joshi

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I don't think running client nodes in kubernetes is supported. You can only
use thin client inside k8s if your server nodes are outside.

Regards,
-- 
Ilya Kasnacheev


вт, 15 дек. 2020 г. в 20:25, Gangaiah Gundeboina <ga...@gmail.com>:

> Hi Ilya,
>
> Our server nodes not running in kubernetes, so we have simple ignite
> cluster
> no relation to kubernetes.
>
> But we have few clients which running inside kubernetes cluster and few
> clients are out side kubernets, both are connecting to same ignite cluster.
>
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

Our server nodes not running in kubernetes, so we have simple ignite cluster
no relation to kubernetes.

But we have few clients which running inside kubernetes cluster and few
clients are out side kubernets, both are connecting to same ignite cluster.




-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I think it is only available in 2.9.0.

Prior to 2.9.0, clients outside of kubernetes may not connect to k8s
cluster.

Regards,
-- 
Ilya Kasnacheev


вт, 15 дек. 2020 г. в 14:55, Gangaiah Gundeboina <ga...@gmail.com>:

> Hi Ilya,
>
> We are using 2.7.6 version, is it available in the version.
>
> And also, the clients which are running in kubernetes able to connect to
> the
> cluster. Sometimes facing the issue whenever we restart the multiple
> clients/adding multiple clients to cluster and suddenly sometimes the error
> printing in logs but cluster is up and running. Sometimes even few clients
> getting disconnected from cluster whenever you see the error in logs but
> not
> all the times.
>
>
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

We are using 2.7.6 version, is it available in the version.

And also, the clients which are running in kubernetes able to connect to the
cluster. Sometimes facing the issue whenever we restart the multiple
clients/adding multiple clients to cluster and suddenly sometimes the error
printing in logs but cluster is up and running. Sometimes even few clients
getting disconnected from cluster whenever you see the error in logs but not
all the times.





-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

We are using 2.7.6 version, is it available in the version. 

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="clientMode" value="true"/>
    <property name="communicationSpi">
        <bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
            <property name="forceClientToServerConnections" value="true"/>
        </bean>
    </property>
</bean>

Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369,
state=RUNNABLE, blockCnt=1570, waitCnt=8053]
        at sun.nio.ch.Net.poll(Native Method)
        at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
        - locked java.lang.Object@77988a56
        at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
        - locked java.lang.Object@545b55af
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.access$6000(TcpCommunicationSpi.java:271)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:4489)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:4294)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi$5.body(TcpCommunicationSpi.java:2237)
        at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


This and other messages in the log point to connectivity problems between
your nodes.
Try enabling communication via discovery:
https://ignite.apache.org/docs/latest/clustering/running-client-nodes-behind-nat

Regards,
-- 
Ilya Kasnacheev


пн, 14 дек. 2020 г. в 21:07, Gangaiah Gundeboina <ga...@gmail.com>:

> Hi Ilya,
>
> We have configured atomictyMode 'ATOMIC' for all caches it's not
> transactional, below is our cache configuration for one of the caches.
>
> ############### Cache Configuration ################
>
> <property name="name" value="DeviceDetailsCache" />
>
>                     <property name="cacheMode" value="PARTITIONED" />
>
>                     <property name="atomicityMode" value="ATOMIC" />
>
>                     <property name="backups" value="1" />
>
>                     <property name="statisticsEnabled" value="true"/>
>
>
>
> I have searched with word 'tcp-comm-worker' in whole log file, could see
> single thread tcp-comm-worker-#1
>
>  And also attached whole log file, please check  ignite.log
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/ignite.log>
>
> You can refer logs from 2020-12-11 12:50:00 to 12:54:00, we can see the
> tread blocked while multiple clients trying to connect to cluster.
>
> #############################
>
> Line 2351: [2020-12-07T02:22:17,350][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1514, waitCnt=7411]
>
>                 Line 4507: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1514, waitCnt=7411]
>
>                 Line 50755: [2020-12-10T05:14:53,287][WARN
> ][sys-stripe-97-#98%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1556, waitCnt=7861]
>
>                 Line 54210: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1556, waitCnt=7861]
>
>                 Line 60084: [2020-12-10T14:30:33,360][WARN
> ][sys-stripe-33-#34%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 62467: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 69864: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 76870: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 84077: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 90968: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 97851: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 105066: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 113229: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 120851: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 124129: [2020-12-10T14:32:36,828][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 128469: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 136378: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 144518: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 150128: [2020-12-10T14:34:40,071][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 155898: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 164791: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 168214: [2020-12-10T14:35:43,175][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 172241: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 175537: [2020-12-10T14:37:46,346][WARN
> ][sys-stripe-64-#65%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 179032: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1561, waitCnt=7918]
>
>                 Line 190529: [2020-12-11T12:52:19,275][WARN
> ][sys-stripe-111-#112%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 194008: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 197166: [2020-12-11T12:54:26,461][WARN
> ][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 200639: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 203939: [2020-12-11T12:56:31,000][WARN
> ][sys-stripe-92-#93%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 207692: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8053]
>
>                 Line 211113: [2020-12-11T12:59:38,095][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>                 Line 214772: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>                 Line 218112: [2020-12-11T13:01:41,252][WARN
> ][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>                 Line 221990: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>                 Line 225629: [2020-12-11T13:03:45,522][WARN
> ][sys-stripe-47-#48%EDIFCustomerCC%][G] Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>                 Line 229948: Thread
> [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
> blockCnt=1570, waitCnt=8054]
>
>
>
> ###########################################
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

We have configured atomictyMode 'ATOMIC' for all caches it's not
transactional, below is our cache configuration for one of the caches. 

############### Cache Configuration ################

<property name="name" value="DeviceDetailsCache" />

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="ATOMIC" />

                    <property name="backups" value="1" />

                    <property name="statisticsEnabled" value="true"/>

 

I have searched with word 'tcp-comm-worker' in whole log file, could see
single thread tcp-comm-worker-#1

 And also attached whole log file, please check  ignite.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2396/ignite.log>  

You can refer logs from 2020-12-11 12:50:00 to 12:54:00, we can see the
tread blocked while multiple clients trying to connect to cluster.

#############################

Line 2351: [2020-12-07T02:22:17,350][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1514, waitCnt=7411]

                Line 4507: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1514, waitCnt=7411]

                Line 50755: [2020-12-10T05:14:53,287][WARN
][sys-stripe-97-#98%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1556, waitCnt=7861]

                Line 54210: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1556, waitCnt=7861]

                Line 60084: [2020-12-10T14:30:33,360][WARN
][sys-stripe-33-#34%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 62467: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 69864: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 76870: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 84077: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 90968: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 97851: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 105066: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 113229: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 120851: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 124129: [2020-12-10T14:32:36,828][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 128469: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 136378: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 144518: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 150128: [2020-12-10T14:34:40,071][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 155898: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 164791: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 168214: [2020-12-10T14:35:43,175][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 172241: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 175537: [2020-12-10T14:37:46,346][WARN
][sys-stripe-64-#65%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 179032: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1561, waitCnt=7918]

                Line 190529: [2020-12-11T12:52:19,275][WARN
][sys-stripe-111-#112%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 194008: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 197166: [2020-12-11T12:54:26,461][WARN
][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 200639: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 203939: [2020-12-11T12:56:31,000][WARN
][sys-stripe-92-#93%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 207692: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8053]

                Line 211113: [2020-12-11T12:59:38,095][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

                Line 214772: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

                Line 218112: [2020-12-11T13:01:41,252][WARN
][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

                Line 221990: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

                Line 225629: [2020-12-11T13:03:45,522][WARN
][sys-stripe-47-#48%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

                Line 229948: Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE,
blockCnt=1570, waitCnt=8054]

 

###########################################



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Can you provide stack trace from that thread?

No, this is a different thread than exchange worker.

I don't think that cluster is supposed to fail clients in this case. We
have timeout for transactions waiting on PME now.

Regards,
-- 
Ilya Kasnacheev


сб, 12 дек. 2020 г. в 10:23, Gangaiah Gundeboina <ga...@gmail.com>:

> Hi Ilya,
>
> Again we got same issue 'tcp-comm-worker-#1' blocked. We have observed one
> thing whenever multiple clients joining to cluster at same time, the error
> is printing. I could see there is only one thread(tcp-comm-worker), checked
> from logs and also from visualvm.
>
> From documentation also, could see there is only one thread per node below
> are details. Is it same thread?
>
> #######################################################
> Initialization of exchange means adding future
> GridDhtPartitionsExchangeFuture to queue on each node.
>
> Put to this queue (GridCachePartitionExchangeManager.ExchangeWorker#futQ)
> is
> done from the discovery thread.
>
> 'Exchange worker' thread manages this queue. Using only one exchange worker
> thread per node provides a strict processing order for futures.
> ########################################################
>
> And also I heard cluster will kick the clients where more
> transactions/atomic updates happening while waiting for partition-exchange
> to complete. Is it could be reason, could you please help us to find out
> the
> cause and fix the issue.
>
> Thanks and Regards,
> Gangaiah
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

Again we got same issue 'tcp-comm-worker-#1' blocked. We have observed one
thing whenever multiple clients joining to cluster at same time, the error
is printing. I could see there is only one thread(tcp-comm-worker), checked
from logs and also from visualvm. 

From documentation also, could see there is only one thread per node below
are details. Is it same thread?

#######################################################
Initialization of exchange means adding future
GridDhtPartitionsExchangeFuture to queue on each node.

Put to this queue (GridCachePartitionExchangeManager.ExchangeWorker#futQ) is
done from the discovery thread.

'Exchange worker' thread manages this queue. Using only one exchange worker
thread per node provides a strict processing order for futures.
########################################################

And also I heard cluster will kick the clients where more
transactions/atomic updates happening while waiting for partition-exchange
to complete. Is it could be reason, could you please help us to find out the
cause and fix the issue.

Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

Regarding exchange worker, the jira saying that 2.8.1 and 2.9 but we are
using 2.7.6 version. So it could be in 2.7.6 as well. But still I am not
understanding here is, there is no node join/left or cluster activation or
cache creation near to the error logging time. Don't know why PME triggered.


Regarding communication worker first four thread dump, if we see it's
blocked 110secs it's huge time. It's happening in production, it's should
not blocked this much time even we have given failure detection time out and
networktimeout 60 secs. I don't think there is network issue in prod, there
is something which blocked this one. We have around 300 thick clients
connected to the prod cluster.

Both are very important for us as it's in prod, any more pointers which
could have cause these issues.

Regards,
Gangaiah





-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

For the first 4 thread dump, the problem was in establishing communication
connection to one of nodes:

[2020-11-09T02:31:41,105][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed
out (consider increasing 'failureDetectionTimeout' configuration property)
[addr=/kub4.101:47100, failureDetectionTimeout=60000]
[2020-11-09T02:31:41,105][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Failed to
connect to a remote node (make sure that destination node is alive and
operating system firewall is disabled on local and remote hosts)
[addrs=[/kub4.101:47100, /127.0.0.1:47100]]


For the last 2, it's not obvious what happens there. This may be an
occurrence of https://issues.apache.org/jira/browse/IGNITE-13540

Regards.
-- 
Ilya Kasnacheev


пт, 13 нояб. 2020 г. в 23:06, Gangaiah Gundeboina <gangaiah.yadav@gmail.com
>:

> Hi Ilya,
>
> Please find attached entire log file.
>
> Regards,
> Gangaiah
>
> 18.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/18.zip>
>
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
Hi Ilya,

Please find attached entire log file.

Regards,
Gangaiah

18.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/18.zip>  



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

We need to see all the lines after the following (but before the next
Thread):

                Line 41312: [2020-11-09T06:44:13,606][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469710]
                Line 43302: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
blockCnt=16, waitCnt=2469710]
                Line 47326: [2020-11-09T10:55:18,888][WARN
][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]
                Line 49473: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]

Regards,
-- 
Ilya Kasnacheev


чт, 12 нояб. 2020 г. в 19:57, Gangaiah Gundeboina <gangaiah.yadav@gmail.com
>:

> HI Ilya Kasnacheev,
>
> Below are log entries with  the tread name 'partition-exchanger'
>
> ############################
>
> Line 41311:
>
> [2020-11-09T06:44:13,605][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> blockedFor=60s]
>                 Line 41315:
> [2020-11-09T06:44:13,606][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604884393601]]]
>                 Line 41316: org.apache.ignite.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604884393601]
>                 Line 47325:
> [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> blockedFor=60s]
>                 Line 47329:
> [2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604899458881]]]
>                 Line 47330: org.apache.ignite.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604899458881]
> #################################################
>
>
> Below is the stack trace for 'exchange-worker-#344' worker thread,
>
> #################################################
> Line 41109: [2020-11-09T04:11:42,068][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=2791, minorTopVer=0], force=false, evt=NODE_JOINED,
> node=f11cfea2-5ece-4867-b276-e94fa8458f47]
>                 Line 41116: [2020-11-09T04:11:52,060][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
> mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
> crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
> minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
> evtNode=5c548b34-defa-4fcd-9bc7-364a4fbec8da, customEvt=null,
> allowMerge=true]
>                 Line 41117: [2020-11-09T04:11:52,062][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=2792,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2792,
> minorTopVer=0],
> err=null]
>                 Line 41118: [2020-11-09T04:11:52,066][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Completed partition exchange
> [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
> [topVer=2792, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
> [id=5c548b34-defa-4fcd-9bc7-364a4fbec8da, addrs=[0:0:0:0:0:0:0:1%lo,
> AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
> sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
> 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
> discPort=0, order=2792, intOrder=1527, lastExchangeTime=1604875311250,
> loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
> topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
> durationFromInit=0]
>                 Line 41119: [2020-11-09T04:11:52,067][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
> [topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], crd=false]
>                 Line 41120: [2020-11-09T04:11:52,072][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=2792, minorTopVer=0], force=false, evt=NODE_JOINED,
> node=5c548b34-defa-4fcd-9bc7-364a4fbec8da]
>                 Line 41124: [2020-11-09T04:11:52,451][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
> mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
> crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
> minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
> evtNode=076584a1-154f-4e3c-8f45-6ee096a71868, customEvt=null,
> allowMerge=true]
>                 Line 41125: [2020-11-09T04:11:52,453][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=2793,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2793,
> minorTopVer=0],
> err=null]
>                 Line 41126: [2020-11-09T04:11:52,458][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Completed partition exchange
> [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
> [topVer=2793, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
> [id=076584a1-154f-4e3c-8f45-6ee096a71868, addrs=[0:0:0:0:0:0:0:1%lo,
> AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
> sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
> 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
> discPort=0, order=2793, intOrder=1528, lastExchangeTime=1604875311403,
> loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
> topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
> durationFromInit=10]
>                 Line 41127: [2020-11-09T04:11:52,458][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
> [topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], crd=false]
>                 Line 41128: [2020-11-09T04:11:52,472][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=2793, minorTopVer=0], force=false, evt=NODE_JOINED,
> node=076584a1-154f-4e3c-8f45-6ee096a71868]
>                 Line 41136: [2020-11-09T04:12:13,569][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
> mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
> crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
> minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
> evtNode=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, customEvt=null,
> allowMerge=true]
>                 Line 41137: [2020-11-09T04:12:13,571][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=2794,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2794,
> minorTopVer=0],
> err=null]
>                 Line 41138: [2020-11-09T04:12:13,575][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
> Completed partition exchange
> [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
> [topVer=2794, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
> [id=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, addrs=[0:0:0:0:0:0:0:1%lo,
> AA.BB.CC.DD, 127.0.0.1], sockAddrs=[AA.BB.CC.DD/AA.BB.CC.DD:0,
> 0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=2794,
> intOrder=1529,
> lastExchangeTime=1604875332557, loc=false,
> ver=2.7.6#20190911-sha1:21f7ca41,
> isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2794,
> minorTopVer=0], durationFromInit=0]
>                 Line 41139: [2020-11-09T04:12:13,576][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
> [topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], crd=false]
>                 Line 41140: [2020-11-09T04:12:13,586][INFO
> ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=2794, minorTopVer=0], force=false, evt=NODE_JOINED,
> node=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21]
>                 Line 41312: [2020-11-09T06:44:13,606][WARN
> ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
> [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=16, waitCnt=2469710]
>                 Line 43302: Thread
> [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
> blockCnt=16, waitCnt=2469710]
>                 Line 47326: [2020-11-09T10:55:18,888][WARN
> ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
> [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=16, waitCnt=2469961]
>                 Line 49473: Thread
> [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=16, waitCnt=2469961]
>
> Thanks and Regards,
> Gangaiah
>
>
>
> -----
> Thanks and Regards,
> Gangaiah
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: partition-exchanger system-critical thread blocked

Posted by Gangaiah Gundeboina <ga...@gmail.com>.
HI Ilya Kasnacheev,

Below are log entries with  the tread name 'partition-exchanger'

############################

Line 41311:
[2020-11-09T06:44:13,605][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 41315:
[2020-11-09T06:44:13,606][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]]]
                Line 41316: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]
                Line 47325:
[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 47329:
[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]]]
                Line 47330: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]
#################################################


Below is the stack trace for 'exchange-worker-#344' worker thread,

#################################################
Line 41109: [2020-11-09T04:11:42,068][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2791, minorTopVer=0], force=false, evt=NODE_JOINED,
node=f11cfea2-5ece-4867-b276-e94fa8458f47]
                Line 41116: [2020-11-09T04:11:52,060][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5c548b34-defa-4fcd-9bc7-364a4fbec8da, customEvt=null,
allowMerge=true]
                Line 41117: [2020-11-09T04:11:52,062][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2792,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
err=null]
                Line 41118: [2020-11-09T04:11:52,066][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5c548b34-defa-4fcd-9bc7-364a4fbec8da, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2792, intOrder=1527, lastExchangeTime=1604875311250,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
durationFromInit=0]
                Line 41119: [2020-11-09T04:11:52,067][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], crd=false]
                Line 41120: [2020-11-09T04:11:52,072][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5c548b34-defa-4fcd-9bc7-364a4fbec8da]
                Line 41124: [2020-11-09T04:11:52,451][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=076584a1-154f-4e3c-8f45-6ee096a71868, customEvt=null,
allowMerge=true]
                Line 41125: [2020-11-09T04:11:52,453][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2793,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
err=null]
                Line 41126: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=076584a1-154f-4e3c-8f45-6ee096a71868, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2793, intOrder=1528, lastExchangeTime=1604875311403,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
durationFromInit=10]
                Line 41127: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], crd=false]
                Line 41128: [2020-11-09T04:11:52,472][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], force=false, evt=NODE_JOINED,
node=076584a1-154f-4e3c-8f45-6ee096a71868]
                Line 41136: [2020-11-09T04:12:13,569][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, customEvt=null,
allowMerge=true]
                Line 41137: [2020-11-09T04:12:13,571][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
err=null]
                Line 41138: [2020-11-09T04:12:13,575][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1], sockAddrs=[AA.BB.CC.DD/AA.BB.CC.DD:0,
0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=2794, intOrder=1529,
lastExchangeTime=1604875332557, loc=false, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], durationFromInit=0]
                Line 41139: [2020-11-09T04:12:13,576][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], crd=false]
                Line 41140: [2020-11-09T04:12:13,586][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21]
                Line 41312: [2020-11-09T06:44:13,606][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469710]
                Line 43302: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
blockCnt=16, waitCnt=2469710]
                Line 47326: [2020-11-09T10:55:18,888][WARN
][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]
                Line 49473: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]

Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: partition-exchanger system-critical thread blocked

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Can you please find an actual stack trace from partition-exchanger thread
in that log?

One that starts with Thread [name="partition-exchanger" ?

Regards,
-- 
Ilya Kasnacheev


вт, 10 нояб. 2020 г. в 15:54, Kamlesh Joshi <Ka...@ril.com>:

> Hi Igniters,
>
>
>
> Sometimes 'Blocked system-critical thread' errors are getting printed in
> prod cluster logs. As per below logs, it's saying exchange-worker blocked
> for 60s. There is no node join/left or cluster activation or cache
> creation, still why partition exchange is triggered. Even it is triggered,
>  it is blocked for *60s *which is huge time from prod perspective.
>
>
>
> Below are error details,
>
> [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> *blockedFor=60s*]
>
> [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391,
> state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]
>
>
>
> Cluster is responding but these errors priting, we are not understanding
> what's the cause, could you please help us.
>
>
>
> Below is log snippet,
>
>
>
> [2020-11-09T10:51:21,458][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3,
> startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557],
> checkpointLockWait=0ms, checkpointLockHoldTime=34ms,
> walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout']
>
> [2020-11-09T10:51:23,499][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3,
> pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165,
> len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663],
> markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms]
>
> [2020-11-09T10:51:36,788][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279664, segIdx=4,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
>
> [2020-11-09T10:51:36,954][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]
>
> [2020-11-09T10:52:02,018][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279665, segIdx=5,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
>
> [2020-11-09T10:52:02,200][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]
>
> [2020-11-09T10:52:36,541][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279666, segIdx=6,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
>
> [2020-11-09T10:52:36,703][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]
>
> [2020-11-09T10:53:11,068][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279667, segIdx=7,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
>
> [2020-11-09T10:53:11,239][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]
>
> [2020-11-09T10:53:45,694][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279668, segIdx=8,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
>
> [2020-11-09T10:53:45,868][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]
>
> [2020-11-09T10:54:21,411][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279669, segIdx=9,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
>
> [2020-11-09T10:54:21,480][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272,
> startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557],
> checkpointLockWait=0ms, checkpointLockHoldTime=51ms,
> walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout']
>
> [2020-11-09T10:54:21,578][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]
>
> [2020-11-09T10:54:23,624][INFO
> ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272,
> pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792,
> len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669],
> markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms]
>
> [2020-11-09T10:54:45,727][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279670, segIdx=0,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
>
> [2020-11-09T10:54:45,886][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]
>
> [2020-11-09T10:55:17,146][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Starting to copy WAL segment [absIdx=1279671, segIdx=1,
> origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal,
> dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
>
> [2020-11-09T10:55:17,343][INFO
> ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager]
> Copied file
> [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal,
> dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]
>
> [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour [threadName=partition-exchanger,
> blockedFor=60s]
>
> [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G]
> Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391,
> state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb,
> ownerName=null, ownerId=-1]
>
>
>
> [2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
> finished=false, heartbeatTs=1604899458881]]]
>
> org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=EDIFCustomerCC, finished=false,
> heartbeatTs=1604899458881]
>
>                 at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]
>
>                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
>
> [2020-11-09T10:55:18,894][WARN
> ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked
> threads detected.
>
> [2020-11-09T10:55:21,404][WARN
> ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long
> JVM pause: 2477 milliseconds.
>
> [2020-11-09T10:55:21,422][WARN
> ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at
> 2020/11/09 10:55:21 IST
>
> Thread [name="sys-#213675%EDIFCustomerCC%", id=369652,
> state=TIMED_WAITING, blockCnt=0, waitCnt=1]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
> ownerName=null, ownerId=-1]
>
>         at sun.misc.Unsafe.park(Native Method)
>
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>
>         at
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
> Thread [name="sys-#213674%EDIFCustomerCC%", id=369651,
> state=TIMED_WAITING, blockCnt=0, waitCnt=1]
>
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
> ownerName=null, ownerId=-1]
>
>         at sun.misc.Unsafe.park(Native Method)
>
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>
>         at
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>