You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by John Smith <ja...@gmail.com> on 2019/10/03 18:50:14 UTC

Node stopped.

So I have been monitoring my node and the same one seems to stop once in a
while.

https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0

I have attached the GC logs and the ignite logs. From what I see from
gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here:
https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Could a "high" packet loss cause issues?

netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
Flg
ens3      1500 230327840      0 2878982 0      195927467      0      0
 0 BMRU
lo       65536    32487      0      0 0         32487      0      0      0
LRU

Some other machines I have noticed only have like 10 dropped packets vs
this machine has millions even though it's 1%?

On Wed, 30 Oct 2019 at 10:38, John Smith <ja...@gmail.com> wrote:

> We have done two things so far...
>
> 1- Disabled client metrics on all clients.
> 2- We noticed we had some DNS issues, Ubuntu introduced new DNS system, so
> we made some fixes here. If somehow nodes where trying to reconnect to each
> other using DNS, then this may help the situation.
>
> So far it seems stable. I'll keep you posted. Maybe eventually I will try
> also clientFailureDetectionTimeout()
>
> On Wed, 30 Oct 2019 at 07:11, Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> I think the node has determined that it need to shut down:
>> [22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor]
>> Ignite node is in invalid state due to a critical failure.
>>
>> You can tune failure handler to avoid this behavior:
>> https://apacheignite.readme.io/docs/critical-failures-handling
>>
>> WRT clientFailureDetectionTimeout, it should be set on server nodes.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 29 окт. 2019 г. в 22:21, John Smith <ja...@gmail.com>:
>>
>>> Hi, the GC logs where also provided and we determined, there was no GC
>>> pressure. At least what I understood from the thread above. We also enabled
>>> some extra thread info for GC and that showed nothing also.
>>>
>>> On the client side I also see this: Possible too long JVM pause: 25468
>>> milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My
>>> client is at 1G
>>> 1- So the SERVER node is stopped because the client was misbehaving?
>>> 2- clientFailureDetectionTimeout should be set on all nodes including
>>> the client?
>>>
>>> On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <il...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> Well, this is the relevant part:
>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] *Possible
>>>> too long JVM pause: 55705 milliseconds*.
>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>> Client node considered as unreachable and will be dropped from cluster,
>>>> because no metrics update messages received in interval:
>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>> network problems or long GC pause on client node, try to increase this
>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>> clientFailureDetectionTimeout=30000]
>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>> Client node considered as unreachable and will be dropped from cluster,
>>>> because no metrics update messages received in interval:
>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>> network problems or long GC pause on client node, try to increase this
>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>> clientFailureDetectionTimeout=30000]
>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
>>>> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
>>>> rmtPort=51827
>>>> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
>>>> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
>>>> blockCnt=6, waitCnt=486312]
>>>>     Lock
>>>> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
>>>> ownerName=null, ownerId=-1]
>>>>
>>>> Long GC will cause node to segment.
>>>>
>>>> Try either decreasing heap size / making sure full GC does not happen,
>>>> or increase failureDetectionTimeout (clientFailureDetectionTimeout if set)
>>>> on all nodes.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> вт, 29 окт. 2019 г. в 18:02, John Smith <ja...@gmail.com>:
>>>>
>>>>> Hi, in my original post I had included full logs of the stopped node...
>>>>>
>>>>> And when you say set to localhost you which property do you mean is it
>>>>> the one IgniteConfiguration setLocalhost()?
>>>>>
>>>>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I think it is possible (such as, node tries to connect to wrong
>>>>>> address / itself) but more detailed analysis of logs is needed.
>>>>>>
>>>>>> You can try specifying localHost property in IgniteConfiguration to
>>>>>> make sure correct address is used.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>>>>>>
>>>>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may
>>>>>>> happen.
>>>>>>>
>>>>>>> When I connect with ignite visor and run the "top" topology command.
>>>>>>> The client node is bound to multiple addresses about 12...
>>>>>>>
>>>>>>> Ignite is running on plain regular VMs while the client node is
>>>>>>> running in a containerized env but using host network (which is regular VM).
>>>>>>>
>>>>>>> So wondering if that would also screw it up something, by trying to
>>>>>>> connect to one of those multiple addresses?
>>>>>>>
>>>>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <
>>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> Sure, if nodes can't reach each other, eventually they may segment
>>>>>>>> and stop.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> --
>>>>>>>> Ilya Kasnacheev
>>>>>>>>
>>>>>>>>
>>>>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>>>>>>
>>>>>>>>> Is it possible this is somehow causing the issue of the node
>>>>>>>>> stopping?
>>>>>>>>>
>>>>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello!
>>>>>>>>>>
>>>>>>>>>> This likely means that you have reachability problems in your
>>>>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82
>>>>>>>>>> (on range 47100-47200) but not the other way around.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> --
>>>>>>>>>> Ilya Kasnacheev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> I also see this printing every few seconds on my client
>>>>>>>>>>> application...
>>>>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>>>>>>
>>>>>>>>>>>> Also recently disabled metrics igniteConfig
>>>>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get
>>>>>>>>>>>> too chatty. But i doubt this would affect it...
>>>>>>>>>>>>
>>>>>>>>>>>> Should I maybe set the node timeout a bit higher then 30
>>>>>>>>>>>> seconds maybe put it to 60 seconds? I remember people suggesting this but
>>>>>>>>>>>> not sure...
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>>>>>> something...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the
>>>>>>>>>>>>>> whole VM pause.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch
>>>>>>>>>>>>>>> Metricbeat running, the CPU usage looked normal at the time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized
>>>>>>>>>>>>>>>> and no VMs that run on it are starving.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs +
>>>>>>>>>>>>>>>>> the safe point logs as well?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful,
>>>>>>>>>>>>>>>>> so I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a
>>>>>>>>>>>>>>>>> high chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there is no
>>>>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>>>>>> virtualised environments when backups are captured from machines or they
>>>>>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=
>>>>>>>>>>>>>>>>>>> 30000
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all
>>>>>>>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the
>>>>>>>>>>>>>>>>>>> server node?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If
>>>>>>>>>>>>>>>>>>>>> not GC then it might have been caused by a network glitch or some
>>>>>>>>>>>>>>>>>>>>> long-running operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one
>>>>>>>>>>>>>>>>>>>>>> seems to stop once in a while.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

We have done two things so far...

1- Disabled client metrics on all clients.
2- We noticed we had some DNS issues, Ubuntu introduced new DNS system, so
we made some fixes here. If somehow nodes where trying to reconnect to each
other using DNS, then this may help the situation.

So far it seems stable. I'll keep you posted. Maybe eventually I will try
also clientFailureDetectionTimeout()

On Wed, 30 Oct 2019 at 07:11, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> I think the node has determined that it need to shut down:
> [22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor]
> Ignite node is in invalid state due to a critical failure.
>
> You can tune failure handler to avoid this behavior:
> https://apacheignite.readme.io/docs/critical-failures-handling
>
> WRT clientFailureDetectionTimeout, it should be set on server nodes.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 29 окт. 2019 г. в 22:21, John Smith <ja...@gmail.com>:
>
>> Hi, the GC logs where also provided and we determined, there was no GC
>> pressure. At least what I understood from the thread above. We also enabled
>> some extra thread info for GC and that showed nothing also.
>>
>> On the client side I also see this: Possible too long JVM pause: 25468
>> milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My
>> client is at 1G
>> 1- So the SERVER node is stopped because the client was misbehaving?
>> 2- clientFailureDetectionTimeout should be set on all nodes including the
>> client?
>>
>> On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <il...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> Well, this is the relevant part:
>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] *Possible
>>> too long JVM pause: 55705 milliseconds*.
>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>> Client node considered as unreachable and will be dropped from cluster,
>>> because no metrics update messages received in interval:
>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>> network problems or long GC pause on client node, try to increase this
>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>> clientFailureDetectionTimeout=30000]
>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>> Client node considered as unreachable and will be dropped from cluster,
>>> because no metrics update messages received in interval:
>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>> network problems or long GC pause on client node, try to increase this
>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>> clientFailureDetectionTimeout=30000]
>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>> system-critical thread has been detected. This can lead to cluster-wide
>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
>>> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
>>> rmtPort=51827
>>> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
>>> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
>>> blockCnt=6, waitCnt=486312]
>>>     Lock
>>> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
>>> ownerName=null, ownerId=-1]
>>>
>>> Long GC will cause node to segment.
>>>
>>> Try either decreasing heap size / making sure full GC does not happen,
>>> or increase failureDetectionTimeout (clientFailureDetectionTimeout if set)
>>> on all nodes.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> вт, 29 окт. 2019 г. в 18:02, John Smith <ja...@gmail.com>:
>>>
>>>> Hi, in my original post I had included full logs of the stopped node...
>>>>
>>>> And when you say set to localhost you which property do you mean is it
>>>> the one IgniteConfiguration setLocalhost()?
>>>>
>>>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <
>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I think it is possible (such as, node tries to connect to wrong
>>>>> address / itself) but more detailed analysis of logs is needed.
>>>>>
>>>>> You can try specifying localHost property in IgniteConfiguration to
>>>>> make sure correct address is used.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>>>>>
>>>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may
>>>>>> happen.
>>>>>>
>>>>>> When I connect with ignite visor and run the "top" topology command.
>>>>>> The client node is bound to multiple addresses about 12...
>>>>>>
>>>>>> Ignite is running on plain regular VMs while the client node is
>>>>>> running in a containerized env but using host network (which is regular VM).
>>>>>>
>>>>>> So wondering if that would also screw it up something, by trying to
>>>>>> connect to one of those multiple addresses?
>>>>>>
>>>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <
>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> Sure, if nodes can't reach each other, eventually they may segment
>>>>>>> and stop.
>>>>>>>
>>>>>>> Regards,
>>>>>>> --
>>>>>>> Ilya Kasnacheev
>>>>>>>
>>>>>>>
>>>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>>>>>
>>>>>>>> Is it possible this is somehow causing the issue of the node
>>>>>>>> stopping?
>>>>>>>>
>>>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello!
>>>>>>>>>
>>>>>>>>> This likely means that you have reachability problems in your
>>>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82
>>>>>>>>> (on range 47100-47200) but not the other way around.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> --
>>>>>>>>> Ilya Kasnacheev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> I also see this printing every few seconds on my client
>>>>>>>>>> application...
>>>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>>>>
>>>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>>>>>
>>>>>>>>>>> Also recently disabled metrics igniteConfig
>>>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get
>>>>>>>>>>> too chatty. But i doubt this would affect it...
>>>>>>>>>>>
>>>>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>>>>>> sure...
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>>>>> something...
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the
>>>>>>>>>>>>> whole VM pause.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch
>>>>>>>>>>>>>> Metricbeat running, the CPU usage looked normal at the time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and
>>>>>>>>>>>>>>> no VMs that run on it are starving.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs +
>>>>>>>>>>>>>>>> the safe point logs as well?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful,
>>>>>>>>>>>>>>>> so I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a
>>>>>>>>>>>>>>>> high chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there is no
>>>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>>>>> virtualised environments when backups are captured from machines or they
>>>>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=
>>>>>>>>>>>>>>>>>> 30000
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all
>>>>>>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If
>>>>>>>>>>>>>>>>>>>> not GC then it might have been caused by a network glitch or some
>>>>>>>>>>>>>>>>>>>> long-running operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one
>>>>>>>>>>>>>>>>>>>>> seems to stop once in a while.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I think the node has determined that it need to shut down:
[22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor]
Ignite node is in invalid state due to a critical failure.

You can tune failure handler to avoid this behavior:
https://apacheignite.readme.io/docs/critical-failures-handling

WRT clientFailureDetectionTimeout, it should be set on server nodes.

Regards,
-- 
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 22:21, John Smith <ja...@gmail.com>:

> Hi, the GC logs where also provided and we determined, there was no GC
> pressure. At least what I understood from the thread above. We also enabled
> some extra thread info for GC and that showed nothing also.
>
> On the client side I also see this: Possible too long JVM pause: 25468
> milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My
> client is at 1G
> 1- So the SERVER node is stopped because the client was misbehaving?
> 2- clientFailureDetectionTimeout should be set on all nodes including the
> client?
>
> On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> Well, this is the relevant part:
>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] *Possible
>> too long JVM pause: 55705 milliseconds*.
>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>> clientFailureDetectionTimeout=30000]
>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>> clientFailureDetectionTimeout=30000]
>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>> system-critical thread has been detected. This can lead to cluster-wide
>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
>> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
>> rmtPort=51827
>> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
>> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
>> blockCnt=6, waitCnt=486312]
>>     Lock
>> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
>> ownerName=null, ownerId=-1]
>>
>> Long GC will cause node to segment.
>>
>> Try either decreasing heap size / making sure full GC does not happen, or
>> increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on
>> all nodes.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 29 окт. 2019 г. в 18:02, John Smith <ja...@gmail.com>:
>>
>>> Hi, in my original post I had included full logs of the stopped node...
>>>
>>> And when you say set to localhost you which property do you mean is it
>>> the one IgniteConfiguration setLocalhost()?
>>>
>>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <il...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I think it is possible (such as, node tries to connect to wrong address
>>>> / itself) but more detailed analysis of logs is needed.
>>>>
>>>> You can try specifying localHost property in IgniteConfiguration to
>>>> make sure correct address is used.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>>>>
>>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may
>>>>> happen.
>>>>>
>>>>> When I connect with ignite visor and run the "top" topology command.
>>>>> The client node is bound to multiple addresses about 12...
>>>>>
>>>>> Ignite is running on plain regular VMs while the client node is
>>>>> running in a containerized env but using host network (which is regular VM).
>>>>>
>>>>> So wondering if that would also screw it up something, by trying to
>>>>> connect to one of those multiple addresses?
>>>>>
>>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Sure, if nodes can't reach each other, eventually they may segment
>>>>>> and stop.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>>>>
>>>>>>> Is it possible this is somehow causing the issue of the node
>>>>>>> stopping?
>>>>>>>
>>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> This likely means that you have reachability problems in your
>>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on
>>>>>>>> range 47100-47200) but not the other way around.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> --
>>>>>>>> Ilya Kasnacheev
>>>>>>>>
>>>>>>>>
>>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>>>>
>>>>>>>>> I also see this printing every few seconds on my client
>>>>>>>>> application...
>>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>>>
>>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>>>>
>>>>>>>>>> Also recently disabled metrics igniteConfig
>>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get too
>>>>>>>>>> chatty. But i doubt this would affect it...
>>>>>>>>>>
>>>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>>>>> sure...
>>>>>>>>>>
>>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>>>> something...
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the
>>>>>>>>>>>> whole VM pause.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <
>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>>>
>>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch
>>>>>>>>>>>>> Metricbeat running, the CPU usage looked normal at the time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and
>>>>>>>>>>>>>> no VMs that run on it are starving.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful,
>>>>>>>>>>>>>>> so I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a
>>>>>>>>>>>>>>> high chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there is no
>>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>>>> virtualised environments when backups are captured from machines or they
>>>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=
>>>>>>>>>>>>>>>>> 30000
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all
>>>>>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not
>>>>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one
>>>>>>>>>>>>>>>>>>>> seems to stop once in a while.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Hi, the GC logs where also provided and we determined, there was no GC
pressure. At least what I understood from the thread above. We also enabled
some extra thread info for GC and that showed nothing also.

On the client side I also see this: Possible too long JVM pause: 25468
milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My
client is at 1G
1- So the SERVER node is stopped because the client was misbehaving?
2- clientFailureDetectionTimeout should be set on all nodes including the
client?

On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> Well, this is the relevant part:
> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] *Possible
> too long JVM pause: 55705 milliseconds*.
> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
> Client node considered as unreachable and will be dropped from cluster,
> because no metrics update messages received in interval:
> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
> network problems or long GC pause on client node, try to increase this
> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
> clientFailureDetectionTimeout=30000]
> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
> Client node considered as unreachable and will be dropped from cluster,
> because no metrics update messages received in interval:
> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
> network problems or long GC pause on client node, try to increase this
> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
> clientFailureDetectionTimeout=30000]
> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
> system-critical thread has been detected. This can lead to cluster-wide
> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> [22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
> Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
> rmtPort=51827
> [22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
> [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
> blockCnt=6, waitCnt=486312]
>     Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
> ownerName=null, ownerId=-1]
>
> Long GC will cause node to segment.
>
> Try either decreasing heap size / making sure full GC does not happen, or
> increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on
> all nodes.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 29 окт. 2019 г. в 18:02, John Smith <ja...@gmail.com>:
>
>> Hi, in my original post I had included full logs of the stopped node...
>>
>> And when you say set to localhost you which property do you mean is it
>> the one IgniteConfiguration setLocalhost()?
>>
>> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <il...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> I think it is possible (such as, node tries to connect to wrong address
>>> / itself) but more detailed analysis of logs is needed.
>>>
>>> You can try specifying localHost property in IgniteConfiguration to make
>>> sure correct address is used.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>>>
>>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may
>>>> happen.
>>>>
>>>> When I connect with ignite visor and run the "top" topology command.
>>>> The client node is bound to multiple addresses about 12...
>>>>
>>>> Ignite is running on plain regular VMs while the client node is running
>>>> in a containerized env but using host network (which is regular VM).
>>>>
>>>> So wondering if that would also screw it up something, by trying to
>>>> connect to one of those multiple addresses?
>>>>
>>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <
>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> Sure, if nodes can't reach each other, eventually they may segment and
>>>>> stop.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>>>
>>>>>> Is it possible this is somehow causing the issue of the node stopping?
>>>>>>
>>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> This likely means that you have reachability problems in your
>>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on
>>>>>>> range 47100-47200) but not the other way around.
>>>>>>>
>>>>>>> Regards,
>>>>>>> --
>>>>>>> Ilya Kasnacheev
>>>>>>>
>>>>>>>
>>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>>>
>>>>>>>> I also see this printing every few seconds on my client
>>>>>>>> application...
>>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>>
>>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>>>
>>>>>>>>> Also recently disabled metrics igniteConfig
>>>>>>>>> .setMetricsLogFrequency(0); Just to make sure it doesn't get too
>>>>>>>>> chatty. But i doubt this would affect it...
>>>>>>>>>
>>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>>>> sure...
>>>>>>>>>
>>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>>> something...
>>>>>>>>>>
>>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> There are no long pauses in the GC logs, so it must be the whole
>>>>>>>>>>> VM pause.
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>>
>>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and
>>>>>>>>>>>>> no VMs that run on it are starving.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so
>>>>>>>>>>>>>> I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there is no
>>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>>> virtualised environments when backups are captured from machines or they
>>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=
>>>>>>>>>>>>>>>> 30000
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all
>>>>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not
>>>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems
>>>>>>>>>>>>>>>>>>> to stop once in a while.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
*Possible
too long JVM pause: 55705 milliseconds*.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
system-critical thread has been detected. This can lead to cluster-wide
undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
[name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
blockCnt=6, waitCnt=486312]
    Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or
increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on
all nodes.

Regards,
-- 
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <ja...@gmail.com>:

> Hi, in my original post I had included full logs of the stopped node...
>
> And when you say set to localhost you which property do you mean is it the
> one IgniteConfiguration setLocalhost()?
>
> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> I think it is possible (such as, node tries to connect to wrong address /
>> itself) but more detailed analysis of logs is needed.
>>
>> You can try specifying localHost property in IgniteConfiguration to make
>> sure correct address is used.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>>
>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.
>>>
>>> When I connect with ignite visor and run the "top" topology command. The
>>> client node is bound to multiple addresses about 12...
>>>
>>> Ignite is running on plain regular VMs while the client node is running
>>> in a containerized env but using host network (which is regular VM).
>>>
>>> So wondering if that would also screw it up something, by trying to
>>> connect to one of those multiple addresses?
>>>
>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <il...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> Sure, if nodes can't reach each other, eventually they may segment and
>>>> stop.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>>
>>>>> Is it possible this is somehow causing the issue of the node stopping?
>>>>>
>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> This likely means that you have reachability problems in your
>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on
>>>>>> range 47100-47200) but not the other way around.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>>
>>>>>>> I also see this printing every few seconds on my client
>>>>>>> application...
>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>
>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>>
>>>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(
>>>>>>>> 0); Just to make sure it doesn't get too chatty. But i doubt this
>>>>>>>> would affect it...
>>>>>>>>
>>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>>> sure...
>>>>>>>>
>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The following documentation page has some useful points on
>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>>> something...
>>>>>>>>>
>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> There are no long pauses in the GC logs, so it must be the whole
>>>>>>>>>> VM pause.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>
>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>
>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use
>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and no
>>>>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so
>>>>>>>>>>>>> I recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It
>>>>>>>>>>>>>>> doesn’t seem to be caused by a garbage collection, since there is no
>>>>>>>>>>>>>>> evidence of GC pressure in the GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>> virtualised environments when backups are captured from machines or they
>>>>>>>>>>>>>>> just don’t have enough CPU time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But if it's the client node that's failing why is the
>>>>>>>>>>>>>>>> server node stopping? I'm pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>> operations. All the client nodes are started as client=true
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not
>>>>>>>>>>>>>>>>> GC then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems
>>>>>>>>>>>>>>>>>> to stop once in a while.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>> what I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the
one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> I think it is possible (such as, node tries to connect to wrong address /
> itself) but more detailed analysis of logs is needed.
>
> You can try specifying localHost property in IgniteConfiguration to make
> sure correct address is used.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:
>
>> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.
>>
>> When I connect with ignite visor and run the "top" topology command. The
>> client node is bound to multiple addresses about 12...
>>
>> Ignite is running on plain regular VMs while the client node is running
>> in a containerized env but using host network (which is regular VM).
>>
>> So wondering if that would also screw it up something, by trying to
>> connect to one of those multiple addresses?
>>
>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <il...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> Sure, if nodes can't reach each other, eventually they may segment and
>>> stop.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>>
>>>> Is it possible this is somehow causing the issue of the node stopping?
>>>>
>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> This likely means that you have reachability problems in your cluster,
>>>>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>>>>> 47100-47200) but not the other way around.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>>
>>>>>> I also see this printing every few seconds on my client application...
>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>
>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>>
>>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0
>>>>>>> ); Just to make sure it doesn't get too chatty. But i doubt this
>>>>>>> would affect it...
>>>>>>>
>>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>>> sure...
>>>>>>>
>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>
>>>>>>>> The following documentation page has some useful points on
>>>>>>>> deployment in a virtualised environment:
>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>>> something...
>>>>>>>>
>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> There are no long pauses in the GC logs, so it must be the whole
>>>>>>>>> VM pause.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>
>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>
>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>>>>> for a proper tooling or logs.
>>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>>>>> makes it freeze for a minute.
>>>>>>>>>>> Also make sure that the physical CPU is not overutilized and no
>>>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>
>>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>>>>> point logs as well?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <
>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>>>>> seem to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server
>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get operations.
>>>>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>>>>> then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems
>>>>>>>>>>>>>>>>> to stop once in a while.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what
>>>>>>>>>>>>>>>>> I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I think it is possible (such as, node tries to connect to wrong address /
itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make
sure correct address is used.

Regards,
-- 
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <ja...@gmail.com>:

> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.
>
> When I connect with ignite visor and run the "top" topology command. The
> client node is bound to multiple addresses about 12...
>
> Ignite is running on plain regular VMs while the client node is running in
> a containerized env but using host network (which is regular VM).
>
> So wondering if that would also screw it up something, by trying to
> connect to one of those multiple addresses?
>
> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> Sure, if nodes can't reach each other, eventually they may segment and
>> stop.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>>
>>> Is it possible this is somehow causing the issue of the node stopping?
>>>
>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <il...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> This likely means that you have reachability problems in your cluster,
>>>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>>>> 47100-47200) but not the other way around.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>>
>>>>> I also see this printing every few seconds on my client application...
>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>
>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, thanks. I already made sure that each ignite VM runs on a
>>>>>> separate host within our cloud. I'm not doing any of that migration stuff.
>>>>>>
>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>>>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>>>>> it...
>>>>>>
>>>>>> Should I maybe set the node timeout a bit higher then 30 seconds
>>>>>> maybe put it to 60 seconds? I remember people suggesting this but not
>>>>>> sure...
>>>>>>
>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The following documentation page has some useful points on
>>>>>>> deployment in a virtualised environment:
>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>
>>>>>>> Denis
>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>>> something...
>>>>>>>
>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>
>>>>>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>>>>>> pause.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>
>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>
>>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>>>> for a proper tooling or logs.
>>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>>>> makes it freeze for a minute.
>>>>>>>>>> Also make sure that the physical CPU is not overutilized and no
>>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the
>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>
>>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>>>> point logs as well?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>>>> seem to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get operations.
>>>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>>>> then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what
>>>>>>>>>>>>>>>> I see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The
client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in
a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect
to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> Sure, if nodes can't reach each other, eventually they may segment and
> stop.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:
>
>> Is it possible this is somehow causing the issue of the node stopping?
>>
>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <il...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> This likely means that you have reachability problems in your cluster,
>>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>>> 47100-47200) but not the other way around.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>>
>>>> I also see this printing every few seconds on my client application...
>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>
>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>>>>> host within our cloud. I'm not doing any of that migration stuff.
>>>>>
>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>>>> it...
>>>>>
>>>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>>>>> put it to 60 seconds? I remember people suggesting this but not sure...
>>>>>
>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The following documentation page has some useful points on deployment
>>>>>> in a virtualised environment:
>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>
>>>>>> Denis
>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>> something...
>>>>>>
>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>>>>> pause.
>>>>>>>
>>>>>>> Denis
>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>
>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>
>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>>> for a proper tooling or logs.
>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>>> makes it freeze for a minute.
>>>>>>>>> Also make sure that the physical CPU is not overutilized and no
>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>>> point logs as well?
>>>>>>>>>>
>>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>>> point logs as well?
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi!
>>>>>>>>>>>>
>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>
>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>> ...
>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>>> seem to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking at safepoint statistics may also reveal some
>>>>>>>>>>>> interesting details. You can learn about safepoints here:
>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>
>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get operations.
>>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>>> then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I
>>>>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Node stopped.

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
-- 
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <ja...@gmail.com>:

> Is it possible this is somehow causing the issue of the node stopping?
>
> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> This likely means that you have reachability problems in your cluster,
>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>> 47100-47200) but not the other way around.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>>
>>> I also see this printing every few seconds on my client application...
>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>
>>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>>>> host within our cloud. I'm not doing any of that migration stuff.
>>>>
>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>>> it...
>>>>
>>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>>>> put it to 60 seconds? I remember people suggesting this but not sure...
>>>>
>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
>>>> wrote:
>>>>
>>>>> The following documentation page has some useful points on deployment
>>>>> in a virtualised environment:
>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>
>>>>> Denis
>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Ok I have metribeat running on the VM hopefully I will see something...
>>>>>
>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>>>> pause.
>>>>>>
>>>>>> Denis
>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>
>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>
>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>
>>>>>>>> Unfortunately, I don’t.
>>>>>>>> You can ask the VM vendor or the cloud provider (if you use any)
>>>>>>>> for a proper tooling or logs.
>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>>> makes it freeze for a minute.
>>>>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>>>>> that run on it are starving.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>> point logs as well?
>>>>>>>>>
>>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>>> point logs as well?
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi!
>>>>>>>>>>>
>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>
>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>> ...
>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>
>>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>>> seem to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>
>>>>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>>>>> details. You can learn about safepoints here:
>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>
>>>>>>>>>>> Denis
>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>
>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get operations.
>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <
>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>>> then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I
>>>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> This likely means that you have reachability problems in your cluster,
> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
> 47100-47200) but not the other way around.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:
>
>> I also see this printing every few seconds on my client application...
>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>> rmtAddr=/xxx.xxx.xxx.82:49816
>>
>> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com> wrote:
>>
>>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>>> host within our cloud. I'm not doing any of that migration stuff.
>>>
>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>> it...
>>>
>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>>> put it to 60 seconds? I remember people suggesting this but not sure...
>>>
>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
>>> wrote:
>>>
>>>> The following documentation page has some useful points on deployment
>>>> in a virtualised environment:
>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>
>>>> Denis
>>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>,
>>>> wrote:
>>>>
>>>> Ok I have metribeat running on the VM hopefully I will see something...
>>>>
>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
>>>> wrote:
>>>>
>>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>>> pause.
>>>>>
>>>>> Denis
>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Sorry here is the gc logs for all 3 machines:
>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>
>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>
>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>> running, the CPU usage looked normal at the time.
>>>>>>
>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Unfortunately, I don’t.
>>>>>>> You can ask the VM vendor or the cloud provider (if you use any) for
>>>>>>> a proper tooling or logs.
>>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>>> makes it freeze for a minute.
>>>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>>>> that run on it are starving.
>>>>>>>
>>>>>>> Denis
>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>
>>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>> point logs as well?
>>>>>>>>
>>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>> Check the lifecycle of your virtual machines. There is a high
>>>>>>>> chance that the whole machine is frozen, not just the Ignite node.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>>> point logs as well?
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi!
>>>>>>>>>>
>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>
>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>> ...
>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=
>>>>>>>>>> *57s*]
>>>>>>>>>>
>>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t
>>>>>>>>>> seem to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>>> have enough CPU time.
>>>>>>>>>>
>>>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>>>> details. You can learn about safepoints here:
>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>
>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>>
>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> But if it's the client node that's failing why is the server
>>>>>>>>>>> node stopping? I'm pretty sure we do verry simple put and get operations.
>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>
>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC
>>>>>>>>>>>> then it might have been caused by a network glitch or some long-running
>>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>> Critical system error detected. Will be handled accordingly to configured
>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>> Denis
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I
>>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>
>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>> fi
>>>>>>>>>>>>>
>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>
>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Node stopped.

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

This likely means that you have reachability problems in your cluster, such
as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but
not the other way around.

Regards,
-- 
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <ja...@gmail.com>:

> I also see this printing every few seconds on my client application...
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
> rmtAddr=/xxx.xxx.xxx.82:49816
>
> On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com> wrote:
>
>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>> host within our cloud. I'm not doing any of that migration stuff.
>>
>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>> it...
>>
>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>> put it to 60 seconds? I remember people suggesting this but not sure...
>>
>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
>> wrote:
>>
>>> The following documentation page has some useful points on deployment in
>>> a virtualised environment:
>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>
>>> Denis
>>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>, wrote:
>>>
>>> Ok I have metribeat running on the VM hopefully I will see something...
>>>
>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
>>> wrote:
>>>
>>>> There are no long pauses in the GC logs, so it must be the whole VM
>>>> pause.
>>>>
>>>> Denis
>>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>,
>>>> wrote:
>>>>
>>>> Sorry here is the gc logs for all 3 machines:
>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>
>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>
>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>> running, the CPU usage looked normal at the time.
>>>>>
>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Unfortunately, I don’t.
>>>>>> You can ask the VM vendor or the cloud provider (if you use any) for
>>>>>> a proper tooling or logs.
>>>>>> Make sure, that there is no such step in the VM’s lifecycle that
>>>>>> makes it freeze for a minute.
>>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>>> that run on it are starving.
>>>>>>
>>>>>> Denis
>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Do you know of any good tools I can use to check the VM?
>>>>>>
>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>> point logs as well?
>>>>>>>
>>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>> Check the lifecycle of your virtual machines. There is a high chance
>>>>>>> that the whole machine is frozen, not just the Ignite node.
>>>>>>>
>>>>>>> Denis
>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>>> point logs as well?
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>
>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>> ...
>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=
>>>>>>>>> *57s*]
>>>>>>>>>
>>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem
>>>>>>>>> to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>>> have enough CPU time.
>>>>>>>>>
>>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>>> details. You can learn about safepoints here:
>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>
>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>>
>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> But if it's the client node that's failing why is the server node
>>>>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>>>>>>> the client nodes are started as client=true
>>>>>>>>>>
>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi John,
>>>>>>>>>>>
>>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then
>>>>>>>>>>> it might have been caused by a network glitch or some long-running
>>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>> Blocked system-critical thread has been detected. This can lead to
>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> So I have been monitoring my node and the same one seems to
>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>
>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>
>>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I
>>>>>>>>>>>> see from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>
>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>
>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>> fi
>>>>>>>>>>>>
>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc
>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>
>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <ja...@gmail.com> wrote:

> Hi, thanks. I already made sure that each ignite VM runs on a separate
> host within our cloud. I'm not doing any of that migration stuff.
>
> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
> Just to make sure it doesn't get too chatty. But i doubt this would affect
> it...
>
> Should I maybe set the node timeout a bit higher then 30 seconds maybe put
> it to 60 seconds? I remember people suggesting this but not sure...
>
> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
> wrote:
>
>> The following documentation page has some useful points on deployment in
>> a virtualised environment:
>> https://apacheignite.readme.io/docs/vmware-deployment
>>
>> Denis
>> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>, wrote:
>>
>> Ok I have metribeat running on the VM hopefully I will see something...
>>
>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
>> wrote:
>>
>>> There are no long pauses in the GC logs, so it must be the whole VM
>>> pause.
>>>
>>> Denis
>>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>, wrote:
>>>
>>> Sorry here is the gc logs for all 3 machines:
>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>
>>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Hi, so it happened again here is my latest gc.log stats:
>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>
>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>> running, the CPU usage looked normal at the time.
>>>>
>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Unfortunately, I don’t.
>>>>> You can ask the VM vendor or the cloud provider (if you use any) for a
>>>>> proper tooling or logs.
>>>>> Make sure, that there is no such step in the VM’s lifecycle that makes
>>>>> it freeze for a minute.
>>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>>> that run on it are starving.
>>>>>
>>>>> Denis
>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Do you know of any good tools I can use to check the VM?
>>>>>
>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>>> point logs as well?
>>>>>>
>>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>> Check the lifecycle of your virtual machines. There is a high chance
>>>>>> that the whole machine is frozen, not just the Ignite node.
>>>>>>
>>>>>> Denis
>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>>>>>> logs as well?
>>>>>>
>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You are correct, it is running in a VM.
>>>>>>>
>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <
>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> There are the following messages in the logs:
>>>>>>>>
>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>> ...
>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=
>>>>>>>> *57s*]
>>>>>>>>
>>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem
>>>>>>>> to be caused by a garbage collection, since there is no evidence of GC
>>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>>> have enough CPU time.
>>>>>>>>
>>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>>> details. You can learn about safepoints here:
>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>
>>>>>>>> Denis
>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>
>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>>
>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> But if it's the client node that's failing why is the server node
>>>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>>>>>> the client nodes are started as client=true
>>>>>>>>>
>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi John,
>>>>>>>>>>
>>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then
>>>>>>>>>> it might have been caused by a network glitch or some long-running
>>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> Denis
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <
>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>>>>>> once in a while.
>>>>>>>>>>>
>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>
>>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>>
>>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>
>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>
>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>> fi
>>>>>>>>>>>
>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>
>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>
>>>>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Hi, thanks. I already made sure that each ignite VM runs on a separate host
within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just
to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put
it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dm...@gmail.com>
wrote:

> The following documentation page has some useful points on deployment in a
> virtualised environment:
> https://apacheignite.readme.io/docs/vmware-deployment
>
> Denis
> On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>, wrote:
>
> Ok I have metribeat running on the VM hopefully I will see something...
>
> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
> wrote:
>
>> There are no long pauses in the GC logs, so it must be the whole VM pause.
>>
>> Denis
>> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>, wrote:
>>
>> Sorry here is the gc logs for all 3 machines:
>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>
>> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:
>>
>>> Hi, so it happened again here is my latest gc.log stats:
>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>
>>> Everything seems ok to me. I also have Elasticsearch Metricbeat running,
>>> the CPU usage looked normal at the time.
>>>
>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
>>> wrote:
>>>
>>>> Unfortunately, I don’t.
>>>> You can ask the VM vendor or the cloud provider (if you use any) for a
>>>> proper tooling or logs.
>>>> Make sure, that there is no such step in the VM’s lifecycle that makes
>>>> it freeze for a minute.
>>>> Also make sure that the physical CPU is not overutilized and no VMs
>>>> that run on it are starving.
>>>>
>>>> Denis
>>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>,
>>>> wrote:
>>>>
>>>> Do you know of any good tools I can use to check the VM?
>>>>
>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
>>>> wrote:
>>>>
>>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe
>>>>> point logs as well?
>>>>>
>>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>>> recommend enabling them for troubleshooting purposes.
>>>>> Check the lifecycle of your virtual machines. There is a high chance
>>>>> that the whole machine is frozen, not just the Ignite node.
>>>>>
>>>>> Denis
>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>>>>> logs as well?
>>>>>
>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You are correct, it is running in a VM.
>>>>>>
>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> There are the following messages in the logs:
>>>>>>>
>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>> ...
>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=
>>>>>>> *57s*]
>>>>>>>
>>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem
>>>>>>> to be caused by a garbage collection, since there is no evidence of GC
>>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>>> environments when backups are captured from machines or they just don’t
>>>>>>> have enough CPU time.
>>>>>>>
>>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>>> details. You can learn about safepoints here:
>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>
>>>>>>> Denis
>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>
>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>>
>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> But if it's the client node that's failing why is the server node
>>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>>>>> the client nodes are started as client=true
>>>>>>>>
>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then
>>>>>>>>> it might have been caused by a network glitch or some long-running
>>>>>>>>> operation started by the app. These logs statement
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>>> because no metrics update messages received in interval:
>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>>>>> once in a while.
>>>>>>>>>>
>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>
>>>>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>>
>>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>
>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>
>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>>> fi
>>>>>>>>>>
>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>
>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>
>>>>>>>>>

Re: Node stopped.

Posted by Denis Mekhanikov <dm...@gmail.com>.

The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <ja...@gmail.com>, wrote:
> Ok I have metribeat running on the VM hopefully I will see something...
>
> > On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > There are no long pauses in the GC logs, so it must be the whole VM pause.
> > >
> > > Denis
> > > On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
> > > >
> > > > > On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:
> > > > > > Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
> > > > > >
> > > > > > Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.
> > > > > >
> > > > > > > On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > Unfortunately, I don’t.
> > > > > > > > You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
> > > > > > > > Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute.
> > > > > > > > Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.
> > > > > > > >
> > > > > > > > Denis
> > > > > > > > On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > Do you know of any good tools I can use to check the VM?
> > > > > > > > >
> > > > > > > > > > On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > > > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > > > > > > > > > >
> > > > > > > > > > > Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
> > > > > > > > > > > Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.
> > > > > > > > > > >
> > > > > > > > > > > Denis
> > > > > > > > > > > On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > You are correct, it is running in a VM.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There are the following messages in the logs:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > > > > > > > > > So the error sais to set clientFailureDetectionTimeout=30000
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1- Do I put a higher value than 30000?
> > > > > > > > > > > > > > > > > 2- Do I do it on the client or the server nodes or all nodes?
> > > > > > > > > > > > > > > > > 3- Also if a client is misbehaving why shutoff the server node?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
> > > > > > > > > > > > > > > > > > > > > Hi John,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > -
> > > > > > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > > > > > So I have been monitoring my node and the same one seems to stop once in a while.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Here are the JVM settings...
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > > > > > > > > > > > > > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > > > > > > > > > > > > > > > > > fi
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dm...@gmail.com>
wrote:

> There are no long pauses in the GC logs, so it must be the whole VM pause.
>
> Denis
> On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>, wrote:
>
> Sorry here is the gc logs for all 3 machines:
> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>
> On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:
>
>> Hi, so it happened again here is my latest gc.log stats:
>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>
>> Everything seems ok to me. I also have Elasticsearch Metricbeat running,
>> the CPU usage looked normal at the time.
>>
>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
>> wrote:
>>
>>> Unfortunately, I don’t.
>>> You can ask the VM vendor or the cloud provider (if you use any) for a
>>> proper tooling or logs.
>>> Make sure, that there is no such step in the VM’s lifecycle that makes
>>> it freeze for a minute.
>>> Also make sure that the physical CPU is not overutilized and no VMs that
>>> run on it are starving.
>>>
>>> Denis
>>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
>>>
>>> Do you know of any good tools I can use to check the VM?
>>>
>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
>>> wrote:
>>>
>>>> > Hi Dennis, so are you saying I should enable GC logs + the safe point
>>>> logs as well?
>>>>
>>>> Having safepoint statistics in your GC logs may be useful, so I
>>>> recommend enabling them for troubleshooting purposes.
>>>> Check the lifecycle of your virtual machines. There is a high chance
>>>> that the whole machine is frozen, not just the Ignite node.
>>>>
>>>> Denis
>>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>,
>>>> wrote:
>>>>
>>>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>>>> logs as well?
>>>>
>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> You are correct, it is running in a VM.
>>>>>
>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> There are the following messages in the logs:
>>>>>>
>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>> ...
>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*
>>>>>> ]
>>>>>>
>>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem to
>>>>>> be caused by a garbage collection, since there is no evidence of GC
>>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>>> environments when backups are captured from machines or they just don’t
>>>>>> have enough CPU time.
>>>>>>
>>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>>> details. You can learn about safepoints here:
>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>
>>>>>> Denis
>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>
>>>>>> 1- Do I put a higher value than 30000?
>>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>>
>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> But if it's the client node that's failing why is the server node
>>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>>>> the client nodes are started as client=true
>>>>>>>
>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>>>>>>> might have been caused by a network glitch or some long-running operation
>>>>>>>> started by the app. These logs statement
>>>>>>>>
>>>>>>>>
>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>> because no metrics update messages received in interval:
>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>
>>>>>>>>
>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>>> because no metrics update messages received in interval:
>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>
>>>>>>>>
>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>>>
>>>>>>>>
>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>>> super=AbstractFailureHandler
>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>>>> once in a while.
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>
>>>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>>
>>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>
>>>>>>>>> Here are the JVM settings...
>>>>>>>>>
>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>>> fi
>>>>>>>>>
>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>
>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>
>>>>>>>>

Re: Node stopped.

Posted by Denis Mekhanikov <dm...@gmail.com>.

There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <ja...@gmail.com>, wrote:
> Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>
> > On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:
> > > Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
> > >
> > > Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.
> > >
> > > > On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > Unfortunately, I don’t.
> > > > > You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
> > > > > Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute.
> > > > > Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.
> > > > >
> > > > > Denis
> > > > > On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > Do you know of any good tools I can use to check the VM?
> > > > > >
> > > > > > > On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > > > > > > >
> > > > > > > > Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
> > > > > > > > Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.
> > > > > > > >
> > > > > > > > Denis
> > > > > > > > On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > > > > > > > >
> > > > > > > > > > On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > You are correct, it is running in a VM.
> > > > > > > > > > >
> > > > > > > > > > > > On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > > > > > > Hi!
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are the following messages in the logs:
> > > > > > > > > > > > >
> > > > > > > > > > > > > [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis
> > > > > > > > > > > > > On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > > > > > > So the error sais to set clientFailureDetectionTimeout=30000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1- Do I put a higher value than 30000?
> > > > > > > > > > > > > > 2- Do I do it on the client or the server nodes or all nodes?
> > > > > > > > > > > > > > 3- Also if a client is misbehaving why shutoff the server node?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > > > But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
> > > > > > > > > > > > > > > > > > Hi John,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > -
> > > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > > > > > > > So I have been monitoring my node and the same one seems to stop once in a while.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Here are the JVM settings...
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > > > > > > > > > > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > > > > > > > > > > > > > > fi
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Sorry here is the gc logs for all 3 machines:
https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <ja...@gmail.com> wrote:

> Hi, so it happened again here is my latest gc.log stats:
> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>
> Everything seems ok to me. I also have Elasticsearch Metricbeat running,
> the CPU usage looked normal at the time.
>
> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
> wrote:
>
>> Unfortunately, I don’t.
>> You can ask the VM vendor or the cloud provider (if you use any) for a
>> proper tooling or logs.
>> Make sure, that there is no such step in the VM’s lifecycle that makes it
>> freeze for a minute.
>> Also make sure that the physical CPU is not overutilized and no VMs that
>> run on it are starving.
>>
>> Denis
>> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
>>
>> Do you know of any good tools I can use to check the VM?
>>
>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
>> wrote:
>>
>>> > Hi Dennis, so are you saying I should enable GC logs + the safe point
>>> logs as well?
>>>
>>> Having safepoint statistics in your GC logs may be useful, so I
>>> recommend enabling them for troubleshooting purposes.
>>> Check the lifecycle of your virtual machines. There is a high chance
>>> that the whole machine is frozen, not just the Ignite node.
>>>
>>> Denis
>>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
>>>
>>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>>> logs as well?
>>>
>>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
>>>
>>>> You are correct, it is running in a VM.
>>>>
>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> There are the following messages in the logs:
>>>>>
>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>> ...
>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*]
>>>>>
>>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem to
>>>>> be caused by a garbage collection, since there is no evidence of GC
>>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>>> environments when backups are captured from machines or they just don’t
>>>>> have enough CPU time.
>>>>>
>>>>> Looking at safepoint statistics may also reveal some interesting
>>>>> details. You can learn about safepoints here:
>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>
>>>>> Denis
>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>,
>>>>> wrote:
>>>>>
>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>
>>>>> 1- Do I put a higher value than 30000?
>>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>>
>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> But if it's the client node that's failing why is the server node
>>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>>> the client nodes are started as client=true
>>>>>>
>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>>>>>> might have been caused by a network glitch or some long-running operation
>>>>>>> started by the app. These logs statement
>>>>>>>
>>>>>>>
>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>> because no metrics update messages received in interval:
>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>
>>>>>>>
>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>>> because no metrics update messages received in interval:
>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>
>>>>>>>
>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>>
>>>>>>>
>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>>> super=AbstractFailureHandler
>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>>> once in a while.
>>>>>>>>
>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>
>>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>>
>>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>
>>>>>>>> Here are the JVM settings...
>>>>>>>>
>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>>> fi
>>>>>>>>
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>
>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>
>>>>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Hi, so it happened again here is my latest gc.log stats:
https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running,
the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dm...@gmail.com>
wrote:

> Unfortunately, I don’t.
> You can ask the VM vendor or the cloud provider (if you use any) for a
> proper tooling or logs.
> Make sure, that there is no such step in the VM’s lifecycle that makes it
> freeze for a minute.
> Also make sure that the physical CPU is not overutilized and no VMs that
> run on it are starving.
>
> Denis
> On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
>
> Do you know of any good tools I can use to check the VM?
>
> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
> wrote:
>
>> > Hi Dennis, so are you saying I should enable GC logs + the safe point
>> logs as well?
>>
>> Having safepoint statistics in your GC logs may be useful, so I recommend
>> enabling them for troubleshooting purposes.
>> Check the lifecycle of your virtual machines. There is a high chance that
>> the whole machine is frozen, not just the Ignite node.
>>
>> Denis
>> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
>>
>> Hi Dennis, so are you saying I should enable GC logs + the safe point
>> logs as well?
>>
>> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
>>
>>> You are correct, it is running in a VM.
>>>
>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
>>> wrote:
>>>
>>>> Hi!
>>>>
>>>> There are the following messages in the logs:
>>>>
>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>> ...
>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*]
>>>>
>>>> Looks like the JVM was paused for almost a minute. It doesn’t seem to
>>>> be caused by a garbage collection, since there is no evidence of GC
>>>> pressure in the GC log. Usually such big pauses happen in virtualised
>>>> environments when backups are captured from machines or they just don’t
>>>> have enough CPU time.
>>>>
>>>> Looking at safepoint statistics may also reveal some interesting
>>>> details. You can learn about safepoints here:
>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>
>>>> Denis
>>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
>>>>
>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>
>>>> 1- Do I put a higher value than 30000?
>>>> 2- Do I do it on the client or the server nodes or all nodes?
>>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>>
>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
>>>>
>>>>> But if it's the client node that's failing why is the server node
>>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>>> the client nodes are started as client=true
>>>>>
>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>>>>> might have been caused by a network glitch or some long-running operation
>>>>>> started by the app. These logs statement
>>>>>>
>>>>>>
>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>> because no metrics update messages received in interval:
>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>
>>>>>>
>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>>> because no metrics update messages received in interval:
>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>>> network problems or long GC pause on client node, try to increase this
>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>
>>>>>>
>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>>
>>>>>>
>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>>> super=AbstractFailureHandler
>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>>> heartbeatTs=1568931981805]]]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So I have been monitoring my node and the same one seems to stop
>>>>>>> once in a while.
>>>>>>>
>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>
>>>>>>> I have attached the GC logs and the ignite logs. From what I see
>>>>>>> from gc.logs I don't see big pauses. I could be wrong.
>>>>>>>
>>>>>>> The machine is 16GB and I have the configs here:
>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>
>>>>>>> Here are the JVM settings...
>>>>>>>
>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>>> fi
>>>>>>>
>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>
>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>
>>>>>>

Re: Node stopped.

Posted by Denis Mekhanikov <dm...@gmail.com>.

Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute.
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <ja...@gmail.com>, wrote:
> Do you know of any good tools I can use to check the VM?
>
> > On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > >
> > > Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
> > > Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.
> > >
> > > Denis
> > > On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
> > > >
> > > > > On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
> > > > > > You are correct, it is running in a VM.
> > > > > >
> > > > > > > On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > There are the following messages in the logs:
> > > > > > > >
> > > > > > > > [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
> > > > > > > > ...
> > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > >
> > > > > > > > Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.
> > > > > > > >
> > > > > > > > Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
> > > > > > > >
> > > > > > > > Denis
> > > > > > > > On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > > > > So the error sais to set clientFailureDetectionTimeout=30000
> > > > > > > > >
> > > > > > > > > 1- Do I put a higher value than 30000?
> > > > > > > > > 2- Do I do it on the client or the server nodes or all nodes?
> > > > > > > > > 3- Also if a client is misbehaving why shutoff the server node?
> > > > > > > > >
> > > > > > > > > > On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true
> > > > > > > > > > >
> > > > > > > > > > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
> > > > > > > > > > > > > Hi John,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > >
> > > > > > > > > > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > >
> > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > >
> > > > > > > > > > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > -
> > > > > > > > > > > > > Denis
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > > > > So I have been monitoring my node and the same one seems to stop once in a while.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Here are the JVM settings...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > > > > > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > > > > > > > > > fi
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <dm...@gmail.com>
wrote:

> > Hi Dennis, so are you saying I should enable GC logs + the safe point
> logs as well?
>
> Having safepoint statistics in your GC logs may be useful, so I recommend
> enabling them for troubleshooting purposes.
> Check the lifecycle of your virtual machines. There is a high chance that
> the whole machine is frozen, not just the Ignite node.
>
> Denis
> On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
>
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs
> as well?
>
> On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
>
>> You are correct, it is running in a VM.
>>
>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
>> wrote:
>>
>>> Hi!
>>>
>>> There are the following messages in the logs:
>>>
>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>> Possible too long JVM pause: *55705 milliseconds*.
>>> ...
>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>> system-critical thread has been detected. This can lead to cluster-wide
>>> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*]
>>>
>>> Looks like the JVM was paused for almost a minute. It doesn’t seem to be
>>> caused by a garbage collection, since there is no evidence of GC pressure
>>> in the GC log. Usually such big pauses happen in virtualised environments
>>> when backups are captured from machines or they just don’t have enough CPU
>>> time.
>>>
>>> Looking at safepoint statistics may also reveal some interesting
>>> details. You can learn about safepoints here:
>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>
>>> Denis
>>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
>>>
>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>
>>> 1- Do I put a higher value than 30000?
>>> 2- Do I do it on the client or the server nodes or all nodes?
>>> 3- Also if a client is misbehaving why shutoff the server node?
>>>
>>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
>>>
>>>> But if it's the client node that's failing why is the server node
>>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>>> the client nodes are started as client=true
>>>>
>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>>>> might have been caused by a network glitch or some long-running operation
>>>>> started by the app. These logs statement
>>>>>
>>>>>
>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>> because no metrics update messages received in interval:
>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>> network problems or long GC pause on client node, try to increase this
>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>> clientFailureDetectionTimeout=30000]
>>>>>
>>>>>
>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>> Client node considered as unreachable and will be dropped from cluster,
>>>>> because no metrics update messages received in interval:
>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>>> network problems or long GC pause on client node, try to increase this
>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>> clientFailureDetectionTimeout=30000]
>>>>>
>>>>>
>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>>
>>>>>
>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical
>>>>> system error detected. Will be handled accordingly to configured handler
>>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>>> super=AbstractFailureHandler
>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>>> heartbeatTs=1568931981805]]]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> So I have been monitoring my node and the same one seems to stop once
>>>>>> in a while.
>>>>>>
>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>
>>>>>> I have attached the GC logs and the ignite logs. From what I see from
>>>>>> gc.logs I don't see big pauses. I could be wrong.
>>>>>>
>>>>>> The machine is 16GB and I have the configs here:
>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>
>>>>>> Here are the JVM settings...
>>>>>>
>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>>> fi
>>>>>>
>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>
>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>
>>>>>

Re: Node stopped.

Posted by Denis Mekhanikov <dm...@gmail.com>.

> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <ja...@gmail.com>, wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?
>
> > On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:
> > > You are correct, it is running in a VM.
> > >
> > > > On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com> wrote:
> > > > > Hi!
> > > > >
> > > > > There are the following messages in the logs:
> > > > >
> > > > > [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
> > > > > ...
> > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > >
> > > > > Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.
> > > > >
> > > > > Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
> > > > >
> > > > > Denis
> > > > > On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
> > > > > > So the error sais to set clientFailureDetectionTimeout=30000
> > > > > >
> > > > > > 1- Do I put a higher value than 30000?
> > > > > > 2- Do I do it on the client or the server nodes or all nodes?
> > > > > > 3- Also if a client is misbehaving why shutoff the server node?
> > > > > >
> > > > > > > On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
> > > > > > > > But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true
> > > > > > > >
> > > > > > > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
> > > > > > > > > > Hi John,
> > > > > > > > > >
> > > > > > > > > > I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > > > > > > >
> > > > > > > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > > > > > > >
> > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > >
> > > > > > > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -
> > > > > > > > > > Denis
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
> > > > > > > > > > > > So I have been monitoring my node and the same one seems to stop once in a while.
> > > > > > > > > > > >
> > > > > > > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > > > > > > >
> > > > > > > > > > > > I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.
> > > > > > > > > > > >
> > > > > > > > > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > > > > > > >
> > > > > > > > > > > > Here are the JVM settings...
> > > > > > > > > > > >
> > > > > > > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > > > > > > fi
> > > > > > > > > > > >
> > > > > > > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > > > > > > >
> > > > > > > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

Hi Dennis, so are you saying I should enable GC logs + the safe point logs
as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <ja...@gmail.com> wrote:

> You are correct, it is running in a VM.
>
> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
> wrote:
>
>> Hi!
>>
>> There are the following messages in the logs:
>>
>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>> Possible too long JVM pause: *55705 milliseconds*.
>> ...
>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>> system-critical thread has been detected. This can lead to cluster-wide
>> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*]
>>
>> Looks like the JVM was paused for almost a minute. It doesn’t seem to be
>> caused by a garbage collection, since there is no evidence of GC pressure
>> in the GC log. Usually such big pauses happen in virtualised environments
>> when backups are captured from machines or they just don’t have enough CPU
>> time.
>>
>> Looking at safepoint statistics may also reveal some interesting details.
>> You can learn about safepoints here:
>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>
>> Denis
>> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
>>
>> So the error sais to set clientFailureDetectionTimeout=30000
>>
>> 1- Do I put a higher value than 30000?
>> 2- Do I do it on the client or the server nodes or all nodes?
>> 3- Also if a client is misbehaving why shutoff the server node?
>>
>> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
>>
>>> But if it's the client node that's failing why is the server node
>>> stopping? I'm pretty sure we do verry simple put and get operations. All
>>> the client nodes are started as client=true
>>>
>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
>>>
>>>> Hi John,
>>>>
>>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>>> might have been caused by a network glitch or some long-running operation
>>>> started by the app. These logs statement
>>>>
>>>>
>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>> Client node considered as unreachable and will be dropped from cluster,
>>>> because no metrics update messages received in interval:
>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>> network problems or long GC pause on client node, try to increase this
>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>> clientFailureDetectionTimeout=30000]
>>>>
>>>>
>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>> Client node considered as unreachable and will be dropped from cluster,
>>>> because no metrics update messages received in interval:
>>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>>> network problems or long GC pause on client node, try to increase this
>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>> clientFailureDetectionTimeout=30000]
>>>>
>>>>
>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>>
>>>>
>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system
>>>> error detected. Will be handled accordingly to configured handler
>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>> super=AbstractFailureHandler
>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>>> heartbeatTs=1568931981805]]]
>>>>
>>>>
>>>>
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> So I have been monitoring my node and the same one seems to stop once
>>>>> in a while.
>>>>>
>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>
>>>>> I have attached the GC logs and the ignite logs. From what I see from
>>>>> gc.logs I don't see big pauses. I could be wrong.
>>>>>
>>>>> The machine is 16GB and I have the configs here:
>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>
>>>>> Here are the JVM settings...
>>>>>
>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>>> fi
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>
>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>
>>>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <dm...@gmail.com>
wrote:

> Hi!
>
> There are the following messages in the logs:
>
> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
> Possible too long JVM pause: *55705 milliseconds*.
> ...
> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
> system-critical thread has been detected. This can lead to cluster-wide
> undefined behaviour [threadName=partition-exchanger, blockedFor=*57s*]
>
> Looks like the JVM was paused for almost a minute. It doesn’t seem to be
> caused by a garbage collection, since there is no evidence of GC pressure
> in the GC log. Usually such big pauses happen in virtualised environments
> when backups are captured from machines or they just don’t have enough CPU
> time.
>
> Looking at safepoint statistics may also reveal some interesting details.
> You can learn about safepoints here:
> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>
> Denis
> On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
>
> So the error sais to set clientFailureDetectionTimeout=30000
>
> 1- Do I put a higher value than 30000?
> 2- Do I do it on the client or the server nodes or all nodes?
> 3- Also if a client is misbehaving why shutoff the server node?
>
> On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
>
>> But if it's the client node that's failing why is the server node
>> stopping? I'm pretty sure we do verry simple put and get operations. All
>> the client nodes are started as client=true
>>
>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
>>
>>> Hi John,
>>>
>>> I don't see any GC pressure or STW pauses either. If not GC then it
>>> might have been caused by a network glitch or some long-running operation
>>> started by the app. These logs statement
>>>
>>>
>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>> Client node considered as unreachable and will be dropped from cluster,
>>> because no metrics update messages received in interval:
>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>> network problems or long GC pause on client node, try to increase this
>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>> clientFailureDetectionTimeout=30000]
>>>
>>>
>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>> Client node considered as unreachable and will be dropped from cluster,
>>> because no metrics update messages received in interval:
>>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>>> network problems or long GC pause on client node, try to increase this
>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>> clientFailureDetectionTimeout=30000]
>>>
>>>
>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>>> system-critical thread has been detected. This can lead to cluster-wide
>>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>>
>>>
>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system
>>> error detected. Will be handled accordingly to configured handler
>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>> super=AbstractFailureHandler
>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>>> heartbeatTs=1568931981805]]]
>>>
>>>
>>>
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> So I have been monitoring my node and the same one seems to stop once
>>>> in a while.
>>>>
>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>
>>>> I have attached the GC logs and the ignite logs. From what I see from
>>>> gc.logs I don't see big pauses. I could be wrong.
>>>>
>>>> The machine is 16GB and I have the configs here:
>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>
>>>> Here are the JVM settings...
>>>>
>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>>> fi
>>>>
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>>
>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>
>>>

Re: Node stopped.

Posted by Denis Mekhanikov <dm...@gmail.com>.

Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <ja...@gmail.com>, wrote:
> So the error sais to set clientFailureDetectionTimeout=30000
>
> 1- Do I put a higher value than 30000?
> 2- Do I do it on the client or the server nodes or all nodes?
> 3- Also if a client is misbehaving why shutoff the server node?
>
> > On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:
> > > But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true
> > >
> > > > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
> > > > > Hi John,
> > > > >
> > > > > I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement
> > > > >
> > > > >
> > > > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > >
> > > > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > >
> > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > >
> > > > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
> > > > > > > So I have been monitoring my node and the same one seems to stop once in a while.
> > > > > > >
> > > > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > >
> > > > > > > I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.
> > > > > > >
> > > > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > >
> > > > > > > Here are the JVM settings...
> > > > > > >
> > > > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > fi
> > > > > > >
> > > > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > >
> > > > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <ja...@gmail.com> wrote:

> But if it's the client node that's failing why is the server node
> stopping? I'm pretty sure we do verry simple put and get operations. All
> the client nodes are started as client=true
>
> On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:
>
>> Hi John,
>>
>> I don't see any GC pressure or STW pauses either. If not GC then it might
>> have been caused by a network glitch or some long-running operation started
>> by the app. These logs statement
>>
>>
>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>> clientFailureDetectionTimeout=30000]
>>
>>
>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>> Client node considered as unreachable and will be dropped from cluster,
>> because no metrics update messages received in interval:
>> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
>> network problems or long GC pause on client node, try to increase this
>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>> clientFailureDetectionTimeout=30000]
>>
>>
>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
>> system-critical thread has been detected. This can lead to cluster-wide
>> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>>
>>
>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system
>> error detected. Will be handled accordingly to configured handler
>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>> super=AbstractFailureHandler
>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
>> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
>> heartbeatTs=1568931981805]]]
>>
>>
>>
>>
>> -
>> Denis
>>
>>
>> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> So I have been monitoring my node and the same one seems to stop once in
>>> a while.
>>>
>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>
>>> I have attached the GC logs and the ignite logs. From what I see from
>>> gc.logs I don't see big pauses. I could be wrong.
>>>
>>> The machine is 16GB and I have the configs here:
>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>
>>> Here are the JVM settings...
>>>
>>> if [ -z "$JVM_OPTS" ] ; then
>>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>>> fi
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>>> -Xloggc:/var/log/apache-ignite/gc.log"
>>>
>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>
>>

Re: Node stopped.

Posted by John Smith <ja...@gmail.com>.

But if it's the client node that's failing why is the server node stopping?
I'm pretty sure we do verry simple put and get operations. All the client
nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dm...@apache.org> wrote:

> Hi John,
>
> I don't see any GC pressure or STW pauses either. If not GC then it might
> have been caused by a network glitch or some long-running operation started
> by the app. These logs statement
>
>
> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
> Client node considered as unreachable and will be dropped from cluster,
> because no metrics update messages received in interval:
> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
> network problems or long GC pause on client node, try to increase this
> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
> clientFailureDetectionTimeout=30000]
>
>
> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
> Client node considered as unreachable and will be dropped from cluster,
> because no metrics update messages received in interval:
> TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
> network problems or long GC pause on client node, try to increase this
> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
> clientFailureDetectionTimeout=30000]
>
>
> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
> system-critical thread has been detected. This can lead to cluster-wide
> undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
>
>
> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system
> error detected. Will be handled accordingly to configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler
> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
> heartbeatTs=1568931981805]]]
>
>
>
>
> -
> Denis
>
>
> On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:
>
>> So I have been monitoring my node and the same one seems to stop once in
>> a while.
>>
>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>
>> I have attached the GC logs and the ignite logs. From what I see from
>> gc.logs I don't see big pauses. I could be wrong.
>>
>> The machine is 16GB and I have the configs here:
>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>
>> Here are the JVM settings...
>>
>> if [ -z "$JVM_OPTS" ] ; then
>>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
>> fi
>>
>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
>> -Xloggc:/var/log/apache-ignite/gc.log"
>>
>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>
>

Re: Node stopped.

Posted by Denis Magda <dm...@apache.org>.

Hi John,

I don't see any GC pressure or STW pauses either. If not GC then it might
have been caused by a network glitch or some long-running operation started
by the app. These logs statement

[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
clientFailureDetectionTimeout=30000]

[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
clientFailureDetectionTimeout=30000]

[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
system-critical thread has been detected. This can lead to cluster-wide
undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system
error detected. Will be handled accordingly to configured handler
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false,
heartbeatTs=1568931981805]]]

-
Denis

On Thu, Oct 3, 2019 at 11:50 AM John Smith <ja...@gmail.com> wrote:

> So I have been monitoring my node and the same one seems to stop once in a
> while.
>
> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>
> I have attached the GC logs and the ignite logs. From what I see from
> gc.logs I don't see big pauses. I could be wrong.
>
> The machine is 16GB and I have the configs here:
> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>
> Here are the JVM settings...
>
> if [ -z "$JVM_OPTS" ] ; then
>     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> fi
>
> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
> -Xloggc:/var/log/apache-ignite/gc.log"
>
> JVM_OPTS="${JVM_OPTS} -Xss16m"
>