You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Akash Shinde <ak...@gmail.com> on 2019/11/27 11:29:40 UTC

Local node terminated after segmentation

Hi ,

I have started four server nodes. One of the node got terminated
unexpectedly giving following error. Before terminating the JVM the node
was segmented.

1) Does ignite always treat node segmentation as "Critical system error"
and use "StopNodeOrHaltFailureHandler" to take required action which
"Teminate Node" in this case?

2) Are there any other reasons for   "Critical system error detected"
message?

I have not set the SegmentationPolicy  explicitly.  AFAIK ignite does not
provide SegmentationResolver and SegmentationPolicy out of box.

3) Do I need to implement SegmentationResolver and set the
SegmenetationPolicy to "STOP" if I want to stop the JVM if the node is
segmented?

4) I am starting Ignite in embedded mode. When a node is segmented  I want
restart the JVM. I
Is there any way to do this? (I am not using ignite.sh/ignite.bat) to start
the ignite.

Please find attached logs.

Exception:












*2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%]
WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED:
TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de,
addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500
<http://10.45.65.97:47500>, /127.0.0.1:47500 <http://127.0.0.1:47500>],
discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983,
loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]2019-11-27
08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN
 o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode
[id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1],
sockAddrs=[/10.45.65.97:47500 <http://10.45.65.97:47500>, /127.0.0.1:47500
<http://127.0.0.1:47500>], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc,
isClient=false]2019-11-27 08:30:46,994 9321190
[tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
Thread tcp-disco-srvr-#3%springDataNode% is terminated
unexpectedly.]]java.lang.IllegalStateException: Thread
tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
      at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)2019-11-27
08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical
system error detected. Will be handled accordingly to configured handler
[hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION,
err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#3%springDataNode% is terminated
unexpectedly.]]java.lang.IllegalStateException: Thread
tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
      at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)2019-11-27
08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will
be halted immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
Thread tcp-disco-srvr-#3%springDataNode% is terminated
unexpectedly.]]2019-11-27 08:30:46,995 9321191
[tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately
due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]*

Re: Local node terminated after segmentation

Posted by VeenaMithare <v....@cmcmarkets.com>.

Thanks Evgenii 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Local node terminated after segmentation

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi Veena,

There is a message in the logs:
 [WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM
pause: 8023 milliseconds.

In most cases, it is a sign of the long GC pause. Of course, this JVM pause
can be related to the problems with a virtual environment or something
else, but usually it's GC. You can collect GC logs to make sure that it's
GC.
If this JVM pause is longer than failureDetectionTimeout, then, node can be
kicked from the cluster.

Evgenii

пн, 13 апр. 2020 г. в 15:32, VeenaMithare <v....@cmcmarkets.com>:

> Hi Ilya,
>
> How can a node reachability resolver or Tcp Segmentation resolver help in
> discovering segmentation due to GC pauses ? What is the best way to
> discover
> segmentation on a node due to GC pauses ?
>
>
>
> regards,
> Veena
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Local node terminated after segmentation

Posted by VeenaMithare <v....@cmcmarkets.com>.

Hi Ilya, 

How can a node reachability resolver or Tcp Segmentation resolver help in
discovering segmentation due to GC pauses ? What is the best way to discover
segmentation on a node due to GC pauses ?



regards,
Veena



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Local node terminated after segmentation

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

Personally I've never seen a split brain. We recommend having collocated
clusters, in which case notes will only fail one by one as opposed to
forming a segmented cluster.
But, if you are really concerned with split brain, you can use
ZooKeeper-based discovery, since ZooKeeper has built-in split brain
protection that you can rely on.

Regards,
-- 
Ilya Kasnacheev


вт, 24 дек. 2019 г. в 14:37, Akash Shinde <ak...@gmail.com>:

> Can someone please help me on this?
>
> On Thu, Dec 12, 2019 at 1:11 PM Akash Shinde <ak...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Can you please explain on high level how GridGain implementations
>> protects from having  two segments that are alive at the same time which
>> could lead to data inconsistency over time? What exactly does it do to
>> achieve this?
>>
>> Regards,
>> A.
>>
>> On Wed, Dec 11, 2019 at 5:48 PM Stanislav Lukyanov <
>> stanlukyanov@gmail.com> wrote:
>>
>>> In Ignite a node can go into "segmented" state in two cases really: 1. A
>>> node was unavailable (sleeping. hanging in full GC, etc) for a long time 2.
>>> Cluster detected a possible split-brain situation and marked the node as
>>> "segmented".
>>>
>>> Yes, split-brain protection (in GridGain implementation and in theory
>>> too) doesn't protect your node from stopping. It protects you from having
>>> two segments that are alive at the same time which could lead to data
>>> inconsistency over time.
>>>
>>> Regarding Discovery and large clusters. If your cluster is too big for
>>> the ring-based TcpDiscoverySpi to work well then you should use Zookeeper
>>> Discovery which was created specifically to support large clusters.
>>>
>>> Stan
>>>
>>> On Mon, Dec 9, 2019 at 4:02 PM Prasad Bhalerao <
>>> prasadbhalerao1983@gmail.com> wrote:
>>>
>>>>
>>>> Can someone please advise on this?
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Prasad Bhalerao <pr...@gmail.com>
>>>>> Date: Fri, Nov 29, 2019 at 7:53 AM
>>>>> Subject: Re: Local node terminated after segmentation
>>>>> To: <us...@ignite.apache.org>
>>>>>
>>>>>
>>>>> I had checked the resource you mentioned, but I was confused with
>>>>> grid-gain doc  describing it as protection against split-brain. Because if
>>>>> the node is segmented the only thing one can do is stop/restart/noop.
>>>>> I was just wondering how it provides protection against split-brain.
>>>>> Now I think by protection it means kill the segmented node/nodes or
>>>>> restart it and bring it back in the cluster .
>>>>>
>>>>> Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the
>>>>> ring right to check if the node is reachable or not.
>>>>> So the question in what situation one needs one more ways to check if
>>>>> the node is reachable or not using different resolvers?
>>>>>
>>>>> Please let me know if my understanding is correct.
>>>>>
>>>>> The article you mentioned, I had checked that code. It requires a node
>>>>> to be configured in advance so that resolver can check if that node is
>>>>> reachable from local host. It doesn't not check if all the nodes are
>>>>> reachable from local host.
>>>>>
>>>>> Eg: node1 will check for node2 and node2 will check for node 3 and
>>>>> node 3 will check for node1 to complete the ring
>>>>> Just wondering how to configure this plugin in prod env with large
>>>>> cluster.
>>>>> I tried to check grid-gain doc to see if they have provided any sample
>>>>> code to configure their plugins just to get an idea but did not find any.
>>>>>
>>>>> Can you please advise?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Prasad
>>>>>
>>>>> On Thu 28 Nov, 2019, 11:41 PM akurbanov <antkr.dev@gmail.com wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Basically this is a mechanism to implement custom logical/network
>>>>>> split-brain protection. Segmentation resolvers allow you to implement
>>>>>> a way
>>>>>> to determine if node has to be segmented/stopped/etc in method
>>>>>> isValidSegment() and possibly use different combinations of resolvers
>>>>>> within
>>>>>> processor.
>>>>>>
>>>>>> If you want to check out how it could be done, some articles/source
>>>>>> samples
>>>>>> that might give you a good insight may be easily found on the web,
>>>>>> like:
>>>>>>
>>>>>> https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
>>>>>>
>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html
>>>>>>
>>>>>> 2-3 are described in the documentation, copying the link just to
>>>>>> point out
>>>>>> which one:
>>>>>> https://apacheignite.readme.io/docs/critical-failures-handling
>>>>>>
>>>>>> By default answer to 2 is: Ignite doesn't ignote node FailureType
>>>>>> SEGMENTATION and calls the failure handler in this case. Actions that
>>>>>> are
>>>>>> taken are defined in failure handler.
>>>>>>
>>>>>> AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
>>>>>> SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you
>>>>>> might
>>>>>> override the failure handler and call .setIgnoredFailureTypes().
>>>>>>
>>>>>> Links:
>>>>>> Extend this class:
>>>>>>
>>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
>>>>>> — check for custom implementations used in Ignite tests and how they
>>>>>> are
>>>>>> used.
>>>>>>
>>>>>> Sample from tests:
>>>>>>
>>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java
>>>>>>
>>>>>> Failure processor:
>>>>>>
>>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java
>>>>>>
>>>>>> Best regards,
>>>>>> Anton
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>>>>
>>>>>

Re: Local node terminated after segmentation

Posted by Akash Shinde <ak...@gmail.com>.

Can someone please help me on this?

On Thu, Dec 12, 2019 at 1:11 PM Akash Shinde <ak...@gmail.com> wrote:

> Hi,
>
> Can you please explain on high level how GridGain implementations protects
> from having  two segments that are alive at the same time which could lead
> to data inconsistency over time? What exactly does it do to achieve this?
>
> Regards,
> A.
>
> On Wed, Dec 11, 2019 at 5:48 PM Stanislav Lukyanov <st...@gmail.com>
> wrote:
>
>> In Ignite a node can go into "segmented" state in two cases really: 1. A
>> node was unavailable (sleeping. hanging in full GC, etc) for a long time 2.
>> Cluster detected a possible split-brain situation and marked the node as
>> "segmented".
>>
>> Yes, split-brain protection (in GridGain implementation and in theory
>> too) doesn't protect your node from stopping. It protects you from having
>> two segments that are alive at the same time which could lead to data
>> inconsistency over time.
>>
>> Regarding Discovery and large clusters. If your cluster is too big for
>> the ring-based TcpDiscoverySpi to work well then you should use Zookeeper
>> Discovery which was created specifically to support large clusters.
>>
>> Stan
>>
>> On Mon, Dec 9, 2019 at 4:02 PM Prasad Bhalerao <
>> prasadbhalerao1983@gmail.com> wrote:
>>
>>>
>>> Can someone please advise on this?
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Prasad Bhalerao <pr...@gmail.com>
>>>> Date: Fri, Nov 29, 2019 at 7:53 AM
>>>> Subject: Re: Local node terminated after segmentation
>>>> To: <us...@ignite.apache.org>
>>>>
>>>>
>>>> I had checked the resource you mentioned, but I was confused with
>>>> grid-gain doc  describing it as protection against split-brain. Because if
>>>> the node is segmented the only thing one can do is stop/restart/noop.
>>>> I was just wondering how it provides protection against split-brain.
>>>> Now I think by protection it means kill the segmented node/nodes or
>>>> restart it and bring it back in the cluster .
>>>>
>>>> Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the
>>>> ring right to check if the node is reachable or not.
>>>> So the question in what situation one needs one more ways to check if
>>>> the node is reachable or not using different resolvers?
>>>>
>>>> Please let me know if my understanding is correct.
>>>>
>>>> The article you mentioned, I had checked that code. It requires a node
>>>> to be configured in advance so that resolver can check if that node is
>>>> reachable from local host. It doesn't not check if all the nodes are
>>>> reachable from local host.
>>>>
>>>> Eg: node1 will check for node2 and node2 will check for node 3 and node
>>>> 3 will check for node1 to complete the ring
>>>> Just wondering how to configure this plugin in prod env with large
>>>> cluster.
>>>> I tried to check grid-gain doc to see if they have provided any sample
>>>> code to configure their plugins just to get an idea but did not find any.
>>>>
>>>> Can you please advise?
>>>>
>>>>
>>>> Thanks,
>>>> Prasad
>>>>
>>>> On Thu 28 Nov, 2019, 11:41 PM akurbanov <antkr.dev@gmail.com wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Basically this is a mechanism to implement custom logical/network
>>>>> split-brain protection. Segmentation resolvers allow you to implement
>>>>> a way
>>>>> to determine if node has to be segmented/stopped/etc in method
>>>>> isValidSegment() and possibly use different combinations of resolvers
>>>>> within
>>>>> processor.
>>>>>
>>>>> If you want to check out how it could be done, some articles/source
>>>>> samples
>>>>> that might give you a good insight may be easily found on the web,
>>>>> like:
>>>>>
>>>>> https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
>>>>>
>>>>> http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html
>>>>>
>>>>> 2-3 are described in the documentation, copying the link just to point
>>>>> out
>>>>> which one:
>>>>> https://apacheignite.readme.io/docs/critical-failures-handling
>>>>>
>>>>> By default answer to 2 is: Ignite doesn't ignote node FailureType
>>>>> SEGMENTATION and calls the failure handler in this case. Actions that
>>>>> are
>>>>> taken are defined in failure handler.
>>>>>
>>>>> AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
>>>>> SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you
>>>>> might
>>>>> override the failure handler and call .setIgnoredFailureTypes().
>>>>>
>>>>> Links:
>>>>> Extend this class:
>>>>>
>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
>>>>> — check for custom implementations used in Ignite tests and how they
>>>>> are
>>>>> used.
>>>>>
>>>>> Sample from tests:
>>>>>
>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java
>>>>>
>>>>> Failure processor:
>>>>>
>>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java
>>>>>
>>>>> Best regards,
>>>>> Anton
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>>>
>>>>

Re: Local node terminated after segmentation

Posted by Akash Shinde <ak...@gmail.com>.

Hi,

Can you please explain on high level how GridGain implementations protects
from having  two segments that are alive at the same time which could lead
to data inconsistency over time? What exactly does it do to achieve this?

Regards,
A.

On Wed, Dec 11, 2019 at 5:48 PM Stanislav Lukyanov <st...@gmail.com>
wrote:

> In Ignite a node can go into "segmented" state in two cases really: 1. A
> node was unavailable (sleeping. hanging in full GC, etc) for a long time 2.
> Cluster detected a possible split-brain situation and marked the node as
> "segmented".
>
> Yes, split-brain protection (in GridGain implementation and in theory too)
> doesn't protect your node from stopping. It protects you from having two
> segments that are alive at the same time which could lead to data
> inconsistency over time.
>
> Regarding Discovery and large clusters. If your cluster is too big for the
> ring-based TcpDiscoverySpi to work well then you should use Zookeeper
> Discovery which was created specifically to support large clusters.
>
> Stan
>
> On Mon, Dec 9, 2019 at 4:02 PM Prasad Bhalerao <
> prasadbhalerao1983@gmail.com> wrote:
>
>>
>> Can someone please advise on this?
>>>
>>> ---------- Forwarded message ---------
>>> From: Prasad Bhalerao <pr...@gmail.com>
>>> Date: Fri, Nov 29, 2019 at 7:53 AM
>>> Subject: Re: Local node terminated after segmentation
>>> To: <us...@ignite.apache.org>
>>>
>>>
>>> I had checked the resource you mentioned, but I was confused with
>>> grid-gain doc  describing it as protection against split-brain. Because if
>>> the node is segmented the only thing one can do is stop/restart/noop.
>>> I was just wondering how it provides protection against split-brain.
>>> Now I think by protection it means kill the segmented node/nodes or
>>> restart it and bring it back in the cluster .
>>>
>>> Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring
>>> right to check if the node is reachable or not.
>>> So the question in what situation one needs one more ways to check if
>>> the node is reachable or not using different resolvers?
>>>
>>> Please let me know if my understanding is correct.
>>>
>>> The article you mentioned, I had checked that code. It requires a node
>>> to be configured in advance so that resolver can check if that node is
>>> reachable from local host. It doesn't not check if all the nodes are
>>> reachable from local host.
>>>
>>> Eg: node1 will check for node2 and node2 will check for node 3 and node
>>> 3 will check for node1 to complete the ring
>>> Just wondering how to configure this plugin in prod env with large
>>> cluster.
>>> I tried to check grid-gain doc to see if they have provided any sample
>>> code to configure their plugins just to get an idea but did not find any.
>>>
>>> Can you please advise?
>>>
>>>
>>> Thanks,
>>> Prasad
>>>
>>> On Thu 28 Nov, 2019, 11:41 PM akurbanov <antkr.dev@gmail.com wrote:
>>>
>>>> Hello,
>>>>
>>>> Basically this is a mechanism to implement custom logical/network
>>>> split-brain protection. Segmentation resolvers allow you to implement a
>>>> way
>>>> to determine if node has to be segmented/stopped/etc in method
>>>> isValidSegment() and possibly use different combinations of resolvers
>>>> within
>>>> processor.
>>>>
>>>> If you want to check out how it could be done, some articles/source
>>>> samples
>>>> that might give you a good insight may be easily found on the web, like:
>>>>
>>>> https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
>>>>
>>>> http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html
>>>>
>>>> 2-3 are described in the documentation, copying the link just to point
>>>> out
>>>> which one:
>>>> https://apacheignite.readme.io/docs/critical-failures-handling
>>>>
>>>> By default answer to 2 is: Ignite doesn't ignote node FailureType
>>>> SEGMENTATION and calls the failure handler in this case. Actions that
>>>> are
>>>> taken are defined in failure handler.
>>>>
>>>> AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
>>>> SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
>>>> override the failure handler and call .setIgnoredFailureTypes().
>>>>
>>>> Links:
>>>> Extend this class:
>>>>
>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
>>>> — check for custom implementations used in Ignite tests and how they are
>>>> used.
>>>>
>>>> Sample from tests:
>>>>
>>>> https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java
>>>>
>>>> Failure processor:
>>>>
>>>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java
>>>>
>>>> Best regards,
>>>> Anton
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>>
>>>