You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Kamlesh Joshi <Ka...@ril.com> on 2020/07/04 11:44:06 UTC

Ignite cluster became unresponsive

Hi Team,

We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0.
Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?

Any help or pointers around this will be appreciated.


[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=partition-exchanger, blockedFor=480s]
[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]

[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
    [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]





Thanks and Regards,
Kamlesh Joshi

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Re: [External]Re: Ignite cluster became unresponsive

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I recommend setting it somewhat lower, but longer than any of your expected
GC pauses. 30s is OK.

Regards,
-- 
Ilya Kasnacheev


вс, 12 июл. 2020 г. в 14:03, Kamlesh Joshi <Ka...@ril.com>:

> Thanks for the findings Ilya.
>
>
>
> So shall we set the same timeout value for *socketWriteTimeout* as that
> of failure detection timeout on both client and server side?
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Ilya Kasnacheev <il...@gmail.com>
> *Sent:* 10 July 2020 19:48
> *To:* user@ignite.apache.org
> *Subject:* Re: [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> Hello!
>
>
>
> It seems that communication connections were closed after CG pause, then
> you have got half-open connections. It is recommended to keep
> socketWriteTimeout and failure detection timeout in relative sync.
>
>
>
> Default socketWriteTimeout on TcpConnectionSpi is very low while your
> failure detection timeout is rather high, leading to such issue.
>
>
>
> It is also possible that client nodes can connect to a server node but not
> vice versa, leading to failure of opening connections once they are closed:
>
>
>
> Thread [name="sys-stripe-12-#13%EDIFCustomerCC%", id=45, state=RUNNABLE,
> blockCnt=851, waitCnt=27526057]
>         at sun.nio.ch.Net.poll(Native Method)
>         at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
>         at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пт, 10 июл. 2020 г. в 16:32, Kamlesh Joshi <Ka...@ril.com>:
>
> Hi Ilya,
>
>
>
> PFA the entire node logs, which contains thread dump as well. Let us know
> if any findings.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Ilya Kasnacheev <il...@gmail.com>
> *Sent:* 10 July 2020 17:51
> *To:* user@ignite.apache.org
> *Subject:* Re: [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> Hello!
>
>
>
> Can you provide full thread dump (jstack) after you see these messages?
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Ka...@ril.com>:
>
> Hi Stephen/Team,
>
>
>
> Did you got any chance to look into this?
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Kamlesh Joshi
> *Sent:* 06 July 2020 14:50
> *To:* user@ignite.apache.org
> *Subject:* RE: [External]Re: Ignite cluster became unresponsive
>
>
>
> Hi Stephen,
>
>
>
> We have started our node with below JVM parameters. Also, we have
> increased these timeouts *failureDetectionTimeout*/
> *clientFailureDetectionTimeout*/*networkTimeout to 480000*.
>
>
>
> *-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
> -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true
> -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000
> -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8
> -DIGNITE_QUIET=false*
>
>
>
> Is there anything else that we have to tune ?
>
>
>
> And I think JVM pause is introduced as a result of the error that we
> encountered right? Correct me if am wrong.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Stephen Darlington <st...@gridgain.com>
> *Sent:* 06 July 2020 14:09
> *To:* user <us...@ignite.apache.org>
> *Subject:* [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> There are a few issues here — the blocked thread, the communication error
> — but I possibly the key one is the JVM pause:
>
>
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
>
>
> This is usually due to garbage collection, but there are a number of other
> possibilities such as slow I/O. Suggest you start with the recommendations
> on the GC tuning documentation page:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
>
>
>
> Regards,
>
> Stephen
>
>
>
> On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com> wrote:
>
>
>
> Hi Team,
>
>
>
> We have encountered following defect in PROD environment. After which
> entire traffic got halted for around 10 minutes, we recently upgraded our
> cluster to Ignite 2.7.6 from 2.6.0.
>
> Is this related to any existing open defect in this version? Has anyone
> observed the same defect earlier ?
>
>
>
> Any help or pointers around this will be appreciated.
>
>
>
>
>
> *[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour*
>
> *[threadName=partition-exchanger, blockedFor=480s]*
>
> *[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread
> [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=1, waitCnt=2049782]*
>
> *    Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4,
> ownerName=null, ownerId=-1]*
>
>
>
> *[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false,
> heartbeatTs=1593780431612]]]*
>
> *org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]*
>
> *[2020-07-03T18:17:11,625][WARN
> ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads
> detected.*
>
> *[2020-07-03T18:17:21,790][INFO
> ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving
> remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416*
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
> *    [2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]*
>
>
>
>
>
>
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

RE: [External]Re: Ignite cluster became unresponsive

Posted by Kamlesh Joshi <Ka...@ril.com>.
Thanks for the findings Ilya.

So shall we set the same timeout value for socketWriteTimeout as that of failure detection timeout on both client and server side?


Thanks and Regards,
Kamlesh Joshi

From: Ilya Kasnacheev <il...@gmail.com>
Sent: 10 July 2020 19:48
To: user@ignite.apache.org
Subject: Re: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.
Hello!

It seems that communication connections were closed after CG pause, then you have got half-open connections. It is recommended to keep socketWriteTimeout and failure detection timeout in relative sync.

Default socketWriteTimeout on TcpConnectionSpi is very low while your failure detection timeout is rather high, leading to such issue.

It is also possible that client nodes can connect to a server node but not vice versa, leading to failure of opening connections once they are closed:

Thread [name="sys-stripe-12-#13%EDIFCustomerCC%", id=45, state=RUNNABLE, blockCnt=851, waitCnt=27526057]
        at sun.nio.ch.Net.poll(Native Method)
        at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
        at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)

Regards,
--
Ilya Kasnacheev


пт, 10 июл. 2020 г. в 16:32, Kamlesh Joshi <Ka...@ril.com>>:
Hi Ilya,

PFA the entire node logs, which contains thread dump as well. Let us know if any findings.

Thanks and Regards,
Kamlesh Joshi

From: Ilya Kasnacheev <il...@gmail.com>>
Sent: 10 July 2020 17:51
To: user@ignite.apache.org<ma...@ignite.apache.org>
Subject: Re: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.
Hello!

Can you provide full thread dump (jstack) after you see these messages?

Regards,
--
Ilya Kasnacheev


ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Ka...@ril.com>>:
Hi Stephen/Team,

Did you got any chance to look into this?

Thanks and Regards,
Kamlesh Joshi

From: Kamlesh Joshi
Sent: 06 July 2020 14:50
To: user@ignite.apache.org<ma...@ignite.apache.org>
Subject: RE: [External]Re: Ignite cluster became unresponsive

Hi Stephen,

We have started our node with below JVM parameters. Also, we have increased these timeouts failureDetectionTimeout/clientFailureDetectionTimeout/networkTimeout to 480000.

-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000 -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8 -DIGNITE_QUIET=false

Is there anything else that we have to tune ?

And I think JVM pause is introduced as a result of the error that we encountered right? Correct me if am wrong.

Thanks and Regards,
Kamlesh Joshi

From: Stephen Darlington <st...@gridgain.com>>
Sent: 06 July 2020 14:09
To: user <us...@ignite.apache.org>>
Subject: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.
There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:

[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.

This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

Regards,
Stephen

On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com>> wrote:

Hi Team,

We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0.
Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?

Any help or pointers around this will be appreciated.


[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=partition-exchanger, blockedFor=480s]
[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]

[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
    [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]





Thanks and Regards,
Kamlesh Joshi


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Re: [External]Re: Ignite cluster became unresponsive

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

It seems that communication connections were closed after CG pause, then
you have got half-open connections. It is recommended to keep
socketWriteTimeout and failure detection timeout in relative sync.

Default socketWriteTimeout on TcpConnectionSpi is very low while your
failure detection timeout is rather high, leading to such issue.

It is also possible that client nodes can connect to a server node but not
vice versa, leading to failure of opening connections once they are closed:

Thread [name="sys-stripe-12-#13%EDIFCustomerCC%", id=45, state=RUNNABLE,
blockCnt=851, waitCnt=27526057]
        at sun.nio.ch.Net.poll(Native Method)
        at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
        at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)

Regards,
-- 
Ilya Kasnacheev


пт, 10 июл. 2020 г. в 16:32, Kamlesh Joshi <Ka...@ril.com>:

> Hi Ilya,
>
>
>
> PFA the entire node logs, which contains thread dump as well. Let us know
> if any findings.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Ilya Kasnacheev <il...@gmail.com>
> *Sent:* 10 July 2020 17:51
> *To:* user@ignite.apache.org
> *Subject:* Re: [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> Hello!
>
>
>
> Can you provide full thread dump (jstack) after you see these messages?
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Ka...@ril.com>:
>
> Hi Stephen/Team,
>
>
>
> Did you got any chance to look into this?
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Kamlesh Joshi
> *Sent:* 06 July 2020 14:50
> *To:* user@ignite.apache.org
> *Subject:* RE: [External]Re: Ignite cluster became unresponsive
>
>
>
> Hi Stephen,
>
>
>
> We have started our node with below JVM parameters. Also, we have
> increased these timeouts *failureDetectionTimeout*/
> *clientFailureDetectionTimeout*/*networkTimeout to 480000*.
>
>
>
> *-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
> -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true
> -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000
> -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8
> -DIGNITE_QUIET=false*
>
>
>
> Is there anything else that we have to tune ?
>
>
>
> And I think JVM pause is introduced as a result of the error that we
> encountered right? Correct me if am wrong.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Stephen Darlington <st...@gridgain.com>
> *Sent:* 06 July 2020 14:09
> *To:* user <us...@ignite.apache.org>
> *Subject:* [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> There are a few issues here — the blocked thread, the communication error
> — but I possibly the key one is the JVM pause:
>
>
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
>
>
> This is usually due to garbage collection, but there are a number of other
> possibilities such as slow I/O. Suggest you start with the recommendations
> on the GC tuning documentation page:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
>
>
>
> Regards,
>
> Stephen
>
>
>
> On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com> wrote:
>
>
>
> Hi Team,
>
>
>
> We have encountered following defect in PROD environment. After which
> entire traffic got halted for around 10 minutes, we recently upgraded our
> cluster to Ignite 2.7.6 from 2.6.0.
>
> Is this related to any existing open defect in this version? Has anyone
> observed the same defect earlier ?
>
>
>
> Any help or pointers around this will be appreciated.
>
>
>
>
>
> *[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour*
>
> *[threadName=partition-exchanger, blockedFor=480s]*
>
> *[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread
> [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=1, waitCnt=2049782]*
>
> *    Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4,
> ownerName=null, ownerId=-1]*
>
>
>
> *[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false,
> heartbeatTs=1593780431612]]]*
>
> *org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]*
>
> *[2020-07-03T18:17:11,625][WARN
> ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads
> detected.*
>
> *[2020-07-03T18:17:21,790][INFO
> ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving
> remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416*
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
> *    [2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]*
>
>
>
>
>
>
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

RE: [External]Re: Ignite cluster became unresponsive

Posted by Kamlesh Joshi <Ka...@ril.com>.
Hi Ilya,

PFA the entire node logs, which contains thread dump as well. Let us know if any findings.

Thanks and Regards,
Kamlesh Joshi

From: Ilya Kasnacheev <il...@gmail.com>
Sent: 10 July 2020 17:51
To: user@ignite.apache.org
Subject: Re: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.
Hello!

Can you provide full thread dump (jstack) after you see these messages?

Regards,
--
Ilya Kasnacheev


ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Ka...@ril.com>>:
Hi Stephen/Team,

Did you got any chance to look into this?

Thanks and Regards,
Kamlesh Joshi

From: Kamlesh Joshi
Sent: 06 July 2020 14:50
To: user@ignite.apache.org<ma...@ignite.apache.org>
Subject: RE: [External]Re: Ignite cluster became unresponsive

Hi Stephen,

We have started our node with below JVM parameters. Also, we have increased these timeouts failureDetectionTimeout/clientFailureDetectionTimeout/networkTimeout to 480000.

-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000 -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8 -DIGNITE_QUIET=false

Is there anything else that we have to tune ?

And I think JVM pause is introduced as a result of the error that we encountered right? Correct me if am wrong.

Thanks and Regards,
Kamlesh Joshi

From: Stephen Darlington <st...@gridgain.com>>
Sent: 06 July 2020 14:09
To: user <us...@ignite.apache.org>>
Subject: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.
There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:

[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.

This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

Regards,
Stephen

On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com>> wrote:

Hi Team,

We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0.
Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?

Any help or pointers around this will be appreciated.


[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=partition-exchanger, blockedFor=480s]
[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]

[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
    [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]





Thanks and Regards,
Kamlesh Joshi


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Re: [External]Re: Ignite cluster became unresponsive

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Can you provide full thread dump (jstack) after you see these messages?

Regards,
-- 
Ilya Kasnacheev


ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Ka...@ril.com>:

> Hi Stephen/Team,
>
>
>
> Did you got any chance to look into this?
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Kamlesh Joshi
> *Sent:* 06 July 2020 14:50
> *To:* user@ignite.apache.org
> *Subject:* RE: [External]Re: Ignite cluster became unresponsive
>
>
>
> Hi Stephen,
>
>
>
> We have started our node with below JVM parameters. Also, we have
> increased these timeouts *failureDetectionTimeout*/
> *clientFailureDetectionTimeout*/*networkTimeout to 480000*.
>
>
>
> *-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
> -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true
> -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000
> -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8
> -DIGNITE_QUIET=false*
>
>
>
> Is there anything else that we have to tune ?
>
>
>
> And I think JVM pause is introduced as a result of the error that we
> encountered right? Correct me if am wrong.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Stephen Darlington <st...@gridgain.com>
> *Sent:* 06 July 2020 14:09
> *To:* user <us...@ignite.apache.org>
> *Subject:* [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> There are a few issues here — the blocked thread, the communication error
> — but I possibly the key one is the JVM pause:
>
>
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
>
>
> This is usually due to garbage collection, but there are a number of other
> possibilities such as slow I/O. Suggest you start with the recommendations
> on the GC tuning documentation page:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
>
>
>
> Regards,
>
> Stephen
>
>
>
> On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com> wrote:
>
>
>
> Hi Team,
>
>
>
> We have encountered following defect in PROD environment. After which
> entire traffic got halted for around 10 minutes, we recently upgraded our
> cluster to Ignite 2.7.6 from 2.6.0.
>
> Is this related to any existing open defect in this version? Has anyone
> observed the same defect earlier ?
>
>
>
> Any help or pointers around this will be appreciated.
>
>
>
>
>
> *[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour*
>
> *[threadName=partition-exchanger, blockedFor=480s]*
>
> *[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread
> [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=1, waitCnt=2049782]*
>
> *    Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4,
> ownerName=null, ownerId=-1]*
>
>
>
> *[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false,
> heartbeatTs=1593780431612]]]*
>
> *org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]*
>
> *[2020-07-03T18:17:11,625][WARN
> ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads
> detected.*
>
> *[2020-07-03T18:17:21,790][INFO
> ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving
> remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416*
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
> *    [2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]*
>
>
>
>
>
>
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

RE: [External]Re: Ignite cluster became unresponsive

Posted by Kamlesh Joshi <Ka...@ril.com>.
Hi Stephen/Team,

Did you got any chance to look into this?

Thanks and Regards,
Kamlesh Joshi

From: Kamlesh Joshi
Sent: 06 July 2020 14:50
To: user@ignite.apache.org
Subject: RE: [External]Re: Ignite cluster became unresponsive

Hi Stephen,

We have started our node with below JVM parameters. Also, we have increased these timeouts failureDetectionTimeout/clientFailureDetectionTimeout/networkTimeout to 480000.

-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000 -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8 -DIGNITE_QUIET=false

Is there anything else that we have to tune ?

And I think JVM pause is introduced as a result of the error that we encountered right? Correct me if am wrong.

Thanks and Regards,
Kamlesh Joshi

From: Stephen Darlington <st...@gridgain.com>>
Sent: 06 July 2020 14:09
To: user <us...@ignite.apache.org>>
Subject: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.

There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:

[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.

This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

Regards,
Stephen

On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com>> wrote:

Hi Team,

We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0.
Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?

Any help or pointers around this will be appreciated.


[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=partition-exchanger, blockedFor=480s]
[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]

[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
    [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]





Thanks and Regards,
Kamlesh Joshi


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

RE: [External]Re: Ignite cluster became unresponsive

Posted by Kamlesh Joshi <Ka...@ril.com>.
Hi Stephen,

We have started our node with below JVM parameters. Also, we have increased these timeouts failureDetectionTimeout/clientFailureDetectionTimeout/networkTimeout to 480000.

-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000 -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8 -DIGNITE_QUIET=false

Is there anything else that we have to tune ?

And I think JVM pause is introduced as a result of the error that we encountered right? Correct me if am wrong.

Thanks and Regards,
Kamlesh Joshi

From: Stephen Darlington <st...@gridgain.com>
Sent: 06 July 2020 14:09
To: user <us...@ignite.apache.org>
Subject: [External]Re: Ignite cluster became unresponsive


The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin.


There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:

[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.

This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

Regards,
Stephen


On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com>> wrote:

Hi Team,

We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0.
Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?

Any help or pointers around this will be appreciated.


[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
[threadName=partition-exchanger, blockedFor=480s]
[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]

[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
[2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
    [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
[2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
[2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]





Thanks and Regards,
Kamlesh Joshi


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s). 
are confidential and may be privileged. If you are not the intended recipient. you are hereby notified that any 
review. re-transmission. conversion to hard copy. copying. circulation or other use of this message and any attachments is 
strictly prohibited. If you are not the intended recipient. please notify the sender immediately by return email. 
and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. 
The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Re: Ignite cluster became unresponsive

Posted by Stephen Darlington <st...@gridgain.com>.
There are a few issues here — the blocked thread, the communication error — but I possibly the key one is the JVM pause:

[2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.

This is usually due to garbage collection, but there are a number of other possibilities such as slow I/O. Suggest you start with the recommendations on the GC tuning documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

Regards,
Stephen

> On 4 Jul 2020, at 12:44, Kamlesh Joshi <Ka...@ril.com> wrote:
> 
> Hi Team,
>  
> We have encountered following defect in PROD environment. After which entire traffic got halted for around 10 minutes, we recently upgraded our cluster to Ignite 2.7.6 from 2.6.0. 
> Is this related to any existing open defect in this version? Has anyone observed the same defect earlier ?
>  
> Any help or pointers around this will be appreciated.
>  
>  
> [2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour
> [threadName=partition-exchanger, blockedFor=480s]
> [2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, blockCnt=1, waitCnt=2049782]
>     Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, ownerName=null, ownerId=-1]
>  
> [2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]]]
> org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]
>     at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]
>     at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]
>     at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]
>     at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]
>     at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]
>     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]
>     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
> [2020-07-03T18:17:11,625][WARN ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads detected.
> [2020-07-03T18:17:21,790][INFO ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416
> [2020-07-03T18:17:21,793][WARN ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM pause: 10133 milliseconds.
>     [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]
> [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]
> [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]
> [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]
> [2020-07-03T18:17:21,794][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]
> [2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]
> [2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]
> [2020-07-03T18:17:21,795][WARN ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]
>  
>  
>  
>  
>  
> Thanks and Regards,
> Kamlesh Joshi
>  
> 
> "Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.
> 
> Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."
>