You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "Lo, Marcus " <ma...@citi.com> on 2022/01/28 02:51:48 UTC

Ignite node crash

Hi Ignite team,

We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:

  *   01p - node id=ee035a96, consistent id=lrdeqprmap01p
  *   02p - node id=81d7df57, consistent id=lrdeqprmap02p
  *   03p - node id=3a275472, consistent id=lrdeqprmap03p
  *   03c - node id=e8c54e6d, consistent id=lcgeqprmap03c
  *   04c - node id=de3959cf, consistent id=lcgeqprmap04c

One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:


  *   From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it's segmented and killed itself due to critical system error.
  *   From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
  *   In 04c, there were a lot of "Possible starvation in stripped pool" warning since 18:35:15.
  *   In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of "Received incoming connection when already connected to this node, rejecting" 04p.
  *   I can confirm that there were no network outage between the nodes.

I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.

Regards,
Marcus

Re: Ignite node crash

Posted by Zhenya Stanilovsky <ar...@mail.ru>.

Hi, at first glance you really have a network problems, check 04c.log :
2022-01-25 18:32:53.858+0000 WARN [grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%] o.a.i.s.c.t.TcpCommunicationSpi          : Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/169.182.110.132:36364, writeTimeout=2000]
 
>Hi Ignite team,
> 
>We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:
>*  01p – node id=ee035a96, consistent id=lrdeqprmap01p
>*  02p – node id=81d7df57, consistent id=lrdeqprmap02p
>*  03p – node id=3a275472, consistent id=lrdeqprmap03p
>*  03c – node id=e8c54e6d, consistent id=lcgeqprmap03c
>*  04c – node id=de3959cf, consistent id=lcgeqprmap04c
> 
>One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:
> 
>*  From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it’s segmented and killed itself due to critical system error.
>*  From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
>*  In 04c, there were a lot of “Possible starvation in stripped pool” warning since 18:35:15.
>*  In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of “Received incoming connection when already connected to this node, rejecting” 04p.
>*  I can confirm that there were no network outage between the nodes.
> 
>I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.
> 
>Regards,
>Marcus
>

FW: Ignite node crash

Posted by "Lo, Marcus " <ma...@citi.com>.

Hi Ignite team,

We are using Ignite 2.10.0 and we have a 5-node Ignite cluster with persistent enabled. The nodes have the following node id and consistent id:

  *   01p - node id=ee035a96, consistent id=lrdeqprmap01p
  *   02p - node id=81d7df57, consistent id=lrdeqprmap02p
  *   03p - node id=3a275472, consistent id=lrdeqprmap03p
  *   03c - node id=e8c54e6d, consistent id=lcgeqprmap03c
  *   04c - node id=de3959cf, consistent id=lcgeqprmap04c

One of the nodes, 03c, crashed one day. We would like to figure out the root cause of the crash. I check the logs with the following findings:


  *   From 03c log, 03c was trying to connect to 04c multiple times, starting from 18:49:56 but all were unsuccessful. Eventually the node thought it's segmented and killed itself due to critical system error.
  *   From 04c log, 04c was rejecting all connections from 03c since 18:49:56, as 04c thought 03c was failed and regarded it as unknown node.
  *   In 04c, there were a lot of "Possible starvation in stripped pool" warning since 18:35:15.
  *   In 04c, there were a lot of TCP client created, trying to connect to 02p since 18:33:51. At the same time, in 02p there were a lot of "Received incoming connection when already connected to this node, rejecting" 04p.
  *   I can confirm that there were no network outage between the nodes.

I have also attached the log for your information, and also our ignite xml config. Can you please help to investigate? Thanks.

Regards,
Marcus