You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Ray <ra...@cisco.com> on 2018/06/08 09:32:45 UTC

Node pause for no obvious reason

I setup a six node Ignite cluster to test the performance and stability.
Here's my setup.
    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{500L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{2L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
      <property name="writeThrottlingEnabled" value="true"/>

And I used this command to start the Ignite node.
./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
-J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
+%Y_%m_%d-%H_%M).log  config/persistent-config.xml

One of the node just dropped from the topology. Here's the log for last
three minutes before this node going down.
[08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8333aa56, uptime=02:34:01.948]
    ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
    ^-- CPU [cur=41%, avg=33.18%, GC=0%]
    ^-- PageMemory [pages=8912687]
    ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
    ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]
[08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
[08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
(total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
cpBufTotal=518215)]
[08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8333aa56, uptime=02:35:01.957]
    ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
    ^-- CPU [cur=9.3%, avg=33%, GC=0%]
    ^-- PageMemory [pages=8920631]
    ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
    ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 22667 milliseconds.
[08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
[08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
[08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
(total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
cpBufTotal=518215)]
[08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 67967 milliseconds.
[08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
seems to be disconnected from topology (failure detection timeout is
reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
[08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712]
[08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
out of topology (probably, due to short-time network problems).
[08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
[08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local
node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-809b1d2e2e5b,
addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29.42.44:49500,
/127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110c7,
isClient=false]
[08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error detected.
Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]

As you can see from the log, the JVM paused for 67967 milliseconds so this
the reason why this node is down.
But I looked at the GC log, here's the last GC log.

2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
(young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
_pending_cards: 32828, predicted base time: 18.85 ms, remaining time: 181.15
ms, target pause time: 200.00 ms]
 9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 2350 regions, survivors: 50 regions, predicted young region time:
79.31 ms]
 9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
98.16 ms, target pause time: 200.00 ms]
, 0.0417411 secs]
   [Parallel Time: 32.8 ms, GC Workers: 38]
      [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max: 9199215.7,
Diff: 0.3]
      [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6, Sum:
21.8]
      [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: 158.3]
         [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: 228]
      [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.2]
      [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
1017.5]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 38]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
8.9]
      [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
Sum: 1225.7]
      [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
Diff: 0.4]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 2.7 ms]
   [Other: 6.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.6 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.4 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 4.4 ms]
   [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
21.6G(31.2G)->3110.7M(31.2G)]
 [Times: user=1.28 sys=0.01, real=0.04 secs]
Heap
 garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
0x00007f9b38807d00, 0x00007fa308000000)
  region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
 Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
1089536K
  class space    used 4241K, capacity 4487K, committed 4556K, reserved
1048576K


Everything looks normal for this GC, but this GC happened at 8:38 and the
JVM freeze happed at around 8:40.
And this picture show what the java heap looks like in the last minutes
before node going down.(Please add four hours here to match the log time
because of different timezone)
Screen_Shot_2018-06-08_at_17.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Screen_Shot_2018-06-08_at_17.png>  

So my question is why is this node going down? 
What's JVM doing in the last minutes before node going down?
Why is not there any logs in the last minutes?
And why isn't the segment_policy RESTART_JVM working?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node pause for no obvious reason

Posted by Andrey Mashenkov <an...@gmail.com>.
Possibly there is a race.
I've created a ticket for this [1]


[1] https://issues.apache.org/jira/browse/IGNITE-8751

On Fri, Jun 8, 2018 at 4:56 PM, Andrey Mashenkov <andrey.mashenkov@gmail.com
> wrote:

> Hi,
>
> Looks like node was segmented due to long JVM pause.
> There are 2 "long JVM pause" messages in long an suspicious long
> checkpoint:
>
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
>
>
> Do you use HDD? Are you sure there is no swapping happens?
> Do you have enough free space on disk?
> Is there any other heavy process on server that may took too much CPU time
> and JVM was descheduled from CPU for too long?
>
>
> It looks weird JVM was not restarted. We've to check such case.
>
> On Fri, Jun 8, 2018 at 12:32 PM, Ray <ra...@cisco.com> wrote:
>
>> I setup a six node Ignite cluster to test the performance and stability.
>> Here's my setup.
>>     <bean id="grid.cfg"
>> class="org.apache.ignite.configuration.IgniteConfiguration">
>>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>>         <property name="peerClassLoadingEnabled" value="true"/>
>>         <property name="failureDetectionTimeout" value="60000"/>
>>         <property name="dataStorageConfiguration">
>>             <bean
>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>             <property name="defaultDataRegionConfiguration">
>>                 <bean
>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>                     <property name="name" value="default_Region"/>
>>                     <property name="initialSize" value="#{100L * 1024 *
>> 1024
>> * 1024}"/>
>>                     <property name="maxSize" value="#{500L * 1024 * 1024 *
>> 1024}"/>
>>                     <property name="persistenceEnabled" value="true"/>
>>                     <property name="checkpointPageBufferSize"
>> value="#{2L *
>> 1024 * 1024 * 1024}"/>
>>                 </bean>
>>             </property>
>>             <property name="walMode" value="BACKGROUND"/>
>>             <property name="walFlushFrequency" value="5000"/>
>>             <property name="checkpointFrequency" value="600000"/>
>>       <property name="writeThrottlingEnabled" value="true"/>
>>
>> And I used this command to start the Ignite node.
>> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
>> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
>> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
>> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
>> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>>
>> One of the node just dropped from the topology. Here's the log for last
>> three minutes before this node going down.
>> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=8333aa56, uptime=02:34:01.948]
>>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>>     ^-- CPU [cur=41%, avg=33.18%, GC=0%]
>>     ^-- PageMemory [pages=8912687]
>>     ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
>>     ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
>>     ^-- Outbound messages queue [size=0]
>>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>>     ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
>> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
>> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
>> fsync=965250ms, total=978172ms]
>> [08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDat
>> abaseSharedManager]
>> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
>> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
>> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
>> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
>> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
>> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
>> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
>> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
>> cpBufTotal=518215)]
>> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
>> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>>     ^-- Node [id=8333aa56, uptime=02:35:01.957]
>>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>>     ^-- CPU [cur=9.3%, avg=33%, GC=0%]
>>     ^-- PageMemory [pages=8920631]
>>     ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
>>     ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
>>     ^-- Outbound messages queue [size=0]
>>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>>     ^-- System thread pool [active=0, idle=6, qSize=0]
>> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM
>> pause: 22667 milliseconds.
>> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
>> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
>> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
>> Throttling
>> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
>> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
>> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
>> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
>> cpBufTotal=518215)]
>> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM
>> pause: 67967 milliseconds.
>> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
>> seems to be disconnected from topology (failure detection timeout is
>> reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
>> [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
>> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
>> accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
>> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
>> spawning a new thread for connection [rmtAddr=/10.29.42.45,
>> rmtPort=38712]
>> [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
>> serving remote node connection [rmtAddr=/10.29.42.45:38712,
>> rmtPort=38712]
>> [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
>> out of topology (probably, due to short-time network problems).
>> [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
>> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
>> [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager]
>> Local
>> node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-80
>> 9b1d2e2e5b,
>> addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29
>> .42.44:49500,
>> /127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
>> lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110
>> c7,
>> isClient=false]
>> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error
>> detected.
>> Will be handled accordingly to configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
>> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
>> Thread
>> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>> java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
>> unexpectedly.
>>         at
>> org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.
>> body(ServerImpl.java:5686)
>>         at
>> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
>> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted
>> immediately
>> due to the failure: [failureCtx=FailureContext
>> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
>> Thread
>> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>>
>> As you can see from the log, the JVM paused for 67967 milliseconds so this
>> the reason why this node is down.
>> But I looked at the GC log, here's the last GC log.
>>
>> 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
>> (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
>> _pending_cards: 32828, predicted base time: 18.85 ms, remaining time:
>> 181.15
>> ms, target pause time: 200.00 ms]
>>  9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
>> eden: 2350 regions, survivors: 50 regions, predicted young region time:
>> 79.31 ms]
>>  9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
>> 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
>> 98.16 ms, target pause time: 200.00 ms]
>> , 0.0417411 secs]
>>    [Parallel Time: 32.8 ms, GC Workers: 38]
>>       [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max:
>> 9199215.7,
>> Diff: 0.3]
>>       [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6,
>> Sum:
>> 21.8]
>>       [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum:
>> 158.3]
>>          [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum:
>> 228]
>>       [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
>>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>> Sum: 0.2]
>>       [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
>> 1017.5]
>>       [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
>> 0.3]
>>          [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>> 38]
>>       [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
>> 8.9]
>>       [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
>> Sum: 1225.7]
>>       [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
>> Diff: 0.4]
>>    [Code Root Fixup: 0.2 ms]
>>    [Code Root Purge: 0.0 ms]
>>    [Clear CT: 2.7 ms]
>>    [Other: 6.1 ms]
>>       [Choose CSet: 0.0 ms]
>>       [Ref Proc: 0.6 ms]
>>       [Ref Enq: 0.0 ms]
>>       [Redirty Cards: 0.4 ms]
>>       [Humongous Register: 0.1 ms]
>>       [Humongous Reclaim: 0.0 ms]
>>       [Free CSet: 4.4 ms]
>>    [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
>> 21.6G(31.2G)->3110.7M(31.2G)]
>>  [Times: user=1.28 sys=0.01, real=0.04 secs]
>> Heap
>>  garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
>> 0x00007f9b38807d00, 0x00007fa308000000)
>>   region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
>>  Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
>> 1089536K
>>   class space    used 4241K, capacity 4487K, committed 4556K, reserved
>> 1048576K
>>
>>
>> Everything looks normal for this GC, but this GC happened at 8:38 and the
>> JVM freeze happed at around 8:40.
>> And this picture show what the java heap looks like in the last minutes
>> before node going down.(Please add four hours here to match the log time
>> because of different timezone)
>> Screen_Shot_2018-06-08_at_17.png
>> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/
>> Screen_Shot_2018-06-08_at_17.png>
>>
>> So my question is why is this node going down?
>> What's JVM doing in the last minutes before node going down?
>> Why is not there any logs in the last minutes?
>> And why isn't the segment_policy RESTART_JVM working?
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Node pause for no obvious reason

Posted by dkarachentsev <dk...@gridgain.com>.
Hi,

Check system logs for that time, maybe there was some system freeze and add
more information in GC logs, for example safepoints:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node pause for no obvious reason

Posted by Ray <ra...@cisco.com>.
I didn't have the dstat logs.
But I think these charts is the same as dstats logs.
CPU usage
CPU.png <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/CPU.png>  
Memory usage
Memory.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Memory.png>  
Swap
swap.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/swap.png>  
Server load
load.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/load.png>  
Disk
disk.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/disk.png>  

We're using bare metal servers for Ignite, each server has 768Gb memory and
56 core CPU.
One node per server.
I don't think hardware is the bottleneck here.

Will enable DEBUG log help in this case?
In my understanding, if JVM is in pause state Ignite can't produce any logs
even DEBUG level log is enabled, right?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node pause for no obvious reason

Posted by Andrey Mashenkov <an...@gmail.com>.
Checkpoint can block other threads only if there another checkpoint should
be started.
But seems checkpoint is not the cause.

It is hard to tell what was going on server.  May be you have any dstat
logs?

How many nodes to you start per machine?
Do you use docker or any VM?

We have bad experience of running multiple ignite nodes on same physical
machine (docker and VM) under high load due to concurrency for resources.
You can try to either switch to 1 ignite instance per machine or map nodes
JVM to CPU or reduce ignite thread pool sizes or tune VM nodes quotas.

On Fri, Jun 8, 2018 at 7:01 PM, Ray <ra...@cisco.com> wrote:

> Hi,
>
> Please see the GC log and the picture I attached.
> Looks like the GC is not taking a very long time.
>
> Yes, the checkpoint is taking a long time to finish.
> Could it be the checkpoint thread has something to do with the node crash?
> In my understanding, the checkpoint will not block other threads, right?
>
> Yes, I'm using HDD and there's plenty of free space on the disc.
> I disabled swapping using sysctl –w vm.swappiness=0 as this document says
> https://apacheignite.readme.io/docs/durable-memory-tuning#
> section-adjust-swappiness-settings/
> The CPU usage is normal when the node goes down.
>
> What's weird about this is that the other 5 nodes have the same
> configuration and under the same heavy write circumstance and they didn't
> go
> down.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Node pause for no obvious reason

Posted by Ray <ra...@cisco.com>.
Hi, 

Please see the GC log and the picture I attached.
Looks like the GC is not taking a very long time.

Yes, the checkpoint is taking a long time to finish.
Could it be the checkpoint thread has something to do with the node crash?
In my understanding, the checkpoint will not block other threads, right?

Yes, I'm using HDD and there's plenty of free space on the disc.
I disabled swapping using sysctl –w vm.swappiness=0 as this document says
https://apacheignite.readme.io/docs/durable-memory-tuning#section-adjust-swappiness-settings/
The CPU usage is normal when the node goes down.

What's weird about this is that the other 5 nodes have the same
configuration and under the same heavy write circumstance and they didn't go
down.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node pause for no obvious reason

Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,

Looks like node was segmented due to long JVM pause.
There are 2 "long JVM pause" messages in long an suspicious long checkpoint:

Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]


Do you use HDD? Are you sure there is no swapping happens?
Do you have enough free space on disk?
Is there any other heavy process on server that may took too much CPU time
and JVM was descheduled from CPU for too long?


It looks weird JVM was not restarted. We've to check such case.

On Fri, Jun 8, 2018 at 12:32 PM, Ray <ra...@cisco.com> wrote:

> I setup a six node Ignite cluster to test the performance and stability.
> Here's my setup.
>     <bean id="grid.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>         <property name="peerClassLoadingEnabled" value="true"/>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>             <property name="defaultDataRegionConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                     <property name="name" value="default_Region"/>
>                     <property name="initialSize" value="#{100L * 1024 *
> 1024
> * 1024}"/>
>                     <property name="maxSize" value="#{500L * 1024 * 1024 *
> 1024}"/>
>                     <property name="persistenceEnabled" value="true"/>
>                     <property name="checkpointPageBufferSize" value="#{2L
> *
> 1024 * 1024 * 1024}"/>
>                 </bean>
>             </property>
>             <property name="walMode" value="BACKGROUND"/>
>             <property name="walFlushFrequency" value="5000"/>
>             <property name="checkpointFrequency" value="600000"/>
>       <property name="writeThrottlingEnabled" value="true"/>
>
> And I used this command to start the Ignite node.
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>
> One of the node just dropped from the topology. Here's the log for last
> three minutes before this node going down.
> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=8333aa56, uptime=02:34:01.948]
>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>     ^-- CPU [cur=41%, avg=33.18%, GC=0%]
>     ^-- PageMemory [pages=8912687]
>     ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
>     ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:40:51,945][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
> [08:40:53,086][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
> cpBufTotal=518215)]
> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=8333aa56, uptime=02:35:01.957]
>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>     ^-- CPU [cur=9.3%, avg=33%, GC=0%]
>     ^-- PageMemory [pages=8920631]
>     ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
>     ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 22667 milliseconds.
> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
> cpBufTotal=518215)]
> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 67967 milliseconds.
> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
> seems to be disconnected from topology (failure detection timeout is
> reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
> [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
> accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
> spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712]
> [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712]
> [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
> out of topology (probably, due to short-time network problems).
> [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
> [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager]
> Local
> node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-809b1d2e2e5b,
> addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29.42.44:49500
> ,
> /127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
> lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:
> 86e110c7,
> isClient=false]
> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error
> detected.
> Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
> Thread
> tcp-disco-srvr-#2 is terminated unexpectedly.]]
> java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
> unexpectedly.
>         at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.
> java:5686)
>         at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
> due to the failure: [failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
> Thread
> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>
> As you can see from the log, the JVM paused for 67967 milliseconds so this
> the reason why this node is down.
> But I looked at the GC log, here's the last GC log.
>
> 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
> (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
> _pending_cards: 32828, predicted base time: 18.85 ms, remaining time:
> 181.15
> ms, target pause time: 200.00 ms]
>  9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
> eden: 2350 regions, survivors: 50 regions, predicted young region time:
> 79.31 ms]
>  9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
> 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
> 98.16 ms, target pause time: 200.00 ms]
> , 0.0417411 secs]
>    [Parallel Time: 32.8 ms, GC Workers: 38]
>       [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max:
> 9199215.7,
> Diff: 0.3]
>       [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6,
> Sum:
> 21.8]
>       [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: 158.3]
>          [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: 228]
>       [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.2]
>       [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
> 1017.5]
>       [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
>          [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 38]
>       [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
> 8.9]
>       [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
> Sum: 1225.7]
>       [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
> Diff: 0.4]
>    [Code Root Fixup: 0.2 ms]
>    [Code Root Purge: 0.0 ms]
>    [Clear CT: 2.7 ms]
>    [Other: 6.1 ms]
>       [Choose CSet: 0.0 ms]
>       [Ref Proc: 0.6 ms]
>       [Ref Enq: 0.0 ms]
>       [Redirty Cards: 0.4 ms]
>       [Humongous Register: 0.1 ms]
>       [Humongous Reclaim: 0.0 ms]
>       [Free CSet: 4.4 ms]
>    [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
> 21.6G(31.2G)->3110.7M(31.2G)]
>  [Times: user=1.28 sys=0.01, real=0.04 secs]
> Heap
>  garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
> 0x00007f9b38807d00, 0x00007fa308000000)
>   region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
>  Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
> 1089536K
>   class space    used 4241K, capacity 4487K, committed 4556K, reserved
> 1048576K
>
>
> Everything looks normal for this GC, but this GC happened at 8:38 and the
> JVM freeze happed at around 8:40.
> And this picture show what the java heap looks like in the last minutes
> before node going down.(Please add four hours here to match the log time
> because of different timezone)
> Screen_Shot_2018-06-08_at_17.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/Screen_Shot_2018-06-08_at_17.png>
>
> So my question is why is this node going down?
> What's JVM doing in the last minutes before node going down?
> Why is not there any logs in the last minutes?
> And why isn't the segment_policy RESTART_JVM working?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov