You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Mahesh Renduchintala <ma...@aline-consulting.com> on 2019/07/05 04:37:25 UTC

ignite cluster lock up

Hi,


we have 10 clients (thick) connected to a ignite cluster (2 node, 16 threads each, plenty of ram).

These clients are expected to stay connected indefinitely.

New clients (thick) keep coming in, do a few queries and then they go out.

All of these work fine for sometime - a few hours.


Then what we notice is, suddenly ignite gets into a lockup mode.

New clients do not get connected. Old clients (those 10 mentioned above) cannot fetch data etc.


THe only way to get out of this lockup is to reboot those 10 clients one after the other.

When a random client in that 10 list is rebooted, the lock goes away and everything works fine.


Attached are the logs.









Re: ignite cluster lock up

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

When full GC is running, all threads are effectively blocked. This is why
it's named 'GC pause'.

Regards,
-- 
Ilya Kasnacheev


сб, 6 июл. 2019 г. в 12:33, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> We are now testing by increasing failureDetectionTimeout values
>
>
> Even if full GC is running, why are ignite system threads blocked?
>
> why aren't ignite system threads free to accept new connections?
>
> Why exactly would rebooting a few of previously connected nodes, reset
> everything.
>
>
> There could be something else as well.
>
>

Re: ignite cluster lock up

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
We are now testing by increasing failureDetectionTimeout values


Even if full GC is running, why are ignite system threads blocked?

why aren't ignite system threads free to accept new connections?

Why exactly would rebooting a few of previously connected nodes, reset everything.


There could be something else as well.

Re: ignite cluster lock up

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

64G is massive amount of heap. You should definitely increase all timeouts
if you have more than, let's say, 16G.

Full GCs have to happen sometime, and they will be long.

Regards,
-- 
Ilya Kasnacheev


пт, 5 июл. 2019 г. в 16:16, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> The long JVM pauses are probably due to long time taken by GC...
>
> The -XMX parameter is 64GB for me.
>
> should I be using more aggressive parameters to free up runtime heap
> quicker on the server node?
>
>
> I am using the recommended JVM options on ignite website.
>
> https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning
> Garbage Collection Tuning - Apache Ignite Documentation
> <https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning>
> apacheignite.readme.io
> Apache Ignite is a memory-centric distributed database, caching, and
> processing platform for transactional, analytical, and streaming workloads,
> delivering in-memory speeds at petabyte scale
>
>
>
>

Re: ignite cluster lock up

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
The long JVM pauses are probably due to long time taken by GC...

The -XMX parameter is 64GB for me.

should I be using more aggressive parameters to free up runtime heap quicker on the server node?


I am using the recommended JVM options on ignite website.

https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning
Garbage Collection Tuning - Apache Ignite Documentation<https://apacheignite.readme.io/docs/jvm-and-system-tuning#garbage-collection-tuning>
apacheignite.readme.io
Apache Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads, delivering in-memory speeds at petabyte scale




Re: ignite cluster lock up

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

It looks to me that the node was out due to long GC or something similar.
Try increasing failureDetectionTimeout on server nodes in case you expect
long pauses.

Regards,
-- 
Ilya Kasnacheev


пт, 5 июл. 2019 г. в 07:43, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> attached are the config files of the server and the client.
>
>
> ------------------------------
> *From:* Mahesh Renduchintala
> *Sent:* Friday, July 5, 2019 12:37 AM
> *To:* user@ignite.apache.org
> *Subject:* ignite cluster lock up
>
>
> Hi,
>
>
> we have 10 clients (thick) connected to a ignite cluster (2 node, 16
> threads each, plenty of ram).
>
> These clients are expected to stay connected indefinitely.
>
> New clients (thick) keep coming in, do a few queries and then they go out.
>
> All of these work fine for sometime - a few hours.
>
>
> Then what we notice is, suddenly ignite gets into a lockup mode.
>
> New clients do not get connected. Old clients (those 10 mentioned above)
> cannot fetch data etc.
>
>
> THe only way to get out of this lockup is to reboot those 10 clients one
> after the other.
>
> When a random client in that 10 list is rebooted, the lock goes away and
> everything works fine.
>
>
> Attached are the logs.
>
>
>
>
>
>
>
>
>
>

Re: ignite cluster lock up

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
attached are the config files of the server and the client.


________________________________
From: Mahesh Renduchintala
Sent: Friday, July 5, 2019 12:37 AM
To: user@ignite.apache.org
Subject: ignite cluster lock up


Hi,


we have 10 clients (thick) connected to a ignite cluster (2 node, 16 threads each, plenty of ram).

These clients are expected to stay connected indefinitely.

New clients (thick) keep coming in, do a few queries and then they go out.

All of these work fine for sometime - a few hours.


Then what we notice is, suddenly ignite gets into a lockup mode.

New clients do not get connected. Old clients (those 10 mentioned above) cannot fetch data etc.


THe only way to get out of this lockup is to reboot those 10 clients one after the other.

When a random client in that 10 list is rebooted, the lock goes away and everything works fine.


Attached are the logs.