You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "Lo, Marcus " <ma...@citi.com> on 2022/01/28 02:51:48 UTC
Ignite node crash
Hi Ignite team,
We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:
* 01p - node id=ee035a96, consistent id=lrdeqprmap01p
* 02p - node id=81d7df57, consistent id=lrdeqprmap02p
* 03p - node id=3a275472, consistent id=lrdeqprmap03p
* 03c - node id=e8c54e6d, consistent id=lcgeqprmap03c
* 04c - node id=de3959cf, consistent id=lcgeqprmap04c
One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:
* From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it's segmented and killed itself due to critical system error.
* From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
* In 04c, there were a lot of "Possible starvation in stripped pool" warning since 18:35:15.
* In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of "Received incoming connection when already connected to this node, rejecting" 04p.
* I can confirm that there were no network outage between the nodes.
I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.
Regards,
Marcus
Re: Ignite node crash
Posted by Zhenya Stanilovsky <ar...@mail.ru>.
Hi, at first glance you really have a network problems, check 04c.log :
2022-01-25 18:32:53.858+0000 WARN [grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%] o.a.i.s.c.t.TcpCommunicationSpi : Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/169.182.110.132:36364, writeTimeout=2000]
>Hi Ignite team,
>
>We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:
>* 01p – node id=ee035a96, consistent id=lrdeqprmap01p
>* 02p – node id=81d7df57, consistent id=lrdeqprmap02p
>* 03p – node id=3a275472, consistent id=lrdeqprmap03p
>* 03c – node id=e8c54e6d, consistent id=lcgeqprmap03c
>* 04c – node id=de3959cf, consistent id=lcgeqprmap04c
>
>One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:
>
>* From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it’s segmented and killed itself due to critical system error.
>* From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
>* In 04c, there were a lot of “Possible starvation in stripped pool” warning since 18:35:15.
>* In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of “Received incoming connection when already connected to this node, rejecting” 04p.
>* I can confirm that there were no network outage between the nodes.
>
>I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.
>
>Regards,
>Marcus
>
FW: Ignite node crash
Posted by "Lo, Marcus " <ma...@citi.com>.
Hi Ignite team,
We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:
* 01p - node id=ee035a96, consistent id=lrdeqprmap01p
* 02p - node id=81d7df57, consistent id=lrdeqprmap02p
* 03p - node id=3a275472, consistent id=lrdeqprmap03p
* 03c - node id=e8c54e6d, consistent id=lcgeqprmap03c
* 04c - node id=de3959cf, consistent id=lcgeqprmap04c
One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:
* From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it's segmented and killed itself due to critical system error.
* From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
* In 04c, there were a lot of "Possible starvation in stripped pool" warning since 18:35:15.
* In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of "Received incoming connection when already connected to this node, rejecting" 04p.
* I can confirm that there were no network outage between the nodes.
I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.
Regards,
Marcus