You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by eugene miretsky <eu...@gmail.com> on 2018/08/30 22:19:01 UTC

Node keeps crashing under load

Hello,

I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes. It
has persistence enabled, and zero backup.
- Full configs are attached.
- JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server -XX:+AggressiveOpts
-XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"

The table has 145M rows, and takes up about 180G of memory
I testing 2 things
1) Writing SQL tables from Spark
2) Performing large SQL queries (from the web console): for example Select
COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)

Most of the times I run the query it fails after one of the nodes crashes
(it has finished a few times, and then crashed the next time). I have also
similar stability issues when writing from Spark - at some point, one of
the nodes crashes. All I can see in the logs is

[21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical system
error detected. Will be handled accordingly to configured handler
[hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
[type=SEGMENTATION, err=null]]

[21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
Ignite node is in invalid state due to a critical failure.

[21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]

[21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]

My questions are:
1) What is causing the issue?
2) How can I debug it better?

The rate of crashes and our lack of ability to debug them is becoming quite
a concern.

Cheers,
Eugene

Re: Node keeps crashing under load

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I have filed this ticket: https://issues.apache.org/jira/browse/IGNITE-9586

Hope that it eventually get looked at by somebody in context.

Regards,
-- 
Ilya Kasnacheev


ср, 12 сент. 2018 г. в 22:10, eugene miretsky <eu...@gmail.com>:

> Good question :)
> yardstick does this, but not sure if it is a valid prod solution.
>
> https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java
>
> We have set preferIPv4Stack=true and provided localAddress in the config -
> it seems to have solved the problem. (Didn't run it enough to be 100% sure)
>
> On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com> wrote:
>
>> Hello!
>>
>> How would you distinguish the wrong interface (172.17.0.1) from the
>> right one if you were Ignite?
>>
>> I think it's not the first time I have seen this problem but I have
>> positively no idea how to tackle it.
>> Maybe Docker experts could chime in?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 12 сент. 2018 г. в 3:29, eugene miretsky <eu...@gmail.com>:
>>
>>> Thanks Ilya,
>>>
>>> We are writing to Ignite from Spark running in EMR. We don't know the
>>> address of the node in advance, we have tried
>>> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the
>>> example online
>>> 2) Leave localHost unset, and let ignite figure out the host
>>>
>>> I have attached more logs at the end.
>>>
>>> My understanding is that Ignite should pick the first non-local address
>>> to publish, however, it seems like it picks randomly one of (a) proper
>>> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>>>
>>> A few questions:
>>> 1) How do we force Spark client to use the proper address
>>> 2) Where is 172.17.0.1 coming from? It is usually the default docker
>>> network host address, and it seems like Ignite creates a network interface
>>> for it on the instance. (otherwise I have no idea where the interface is
>>> coming from)
>>> 3) If there are communication errors, shouldn't the Zookeeper split
>>> brain resolver kick in and shut down the dead node. Or shouldn't at least
>>> the initiating node mark the remote node as dead?
>>>
>>> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>>> rmtAddr=/172.21.86.7:41648]
>>>
>>> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>>> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>>> rmtAddr=/127.0.0.1:37656]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/
>>> 172.21.86.7:53272, rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/
>>> 172.21.86.175:47100]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
>>> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>>>
>>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
>>> rmtAddr=/127.0.0.1:47100]
>>>
>>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection
>>> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>>> rmtAddr=/172.21.86.7:41656]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>>> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>>> rmtAddr=/127.0.0.1:37664]
>>>
>>> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/
>>> 172.21.86.7:41076, rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/
>>> 172.21.86.229:47100]
>>>
>>>
>>>
>>>
>>> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
>>> ilya.kasnacheev@gmail.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> I can see a lot of errors like this one:
>>>>
>>>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>>>> Created new communication error process future
>>>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>>>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>>>> node still alive?). Make sure that each ComputeTask and cache Transaction
>>>> has a timeout set in order to prevent parties from waiting forever in case
>>>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>>>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>>>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>>>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>>>
>>>> I think the problem is, you have two nodes, they both have 172.17.0.1
>>>> address but it's the different address (totally unrelated private nets).
>>>>
>>>> Try to specify your external address (such as 172.21.85.213) with
>>>> TcpCommunicationSpi.setLocalAddress() on each node.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eugene.miretsky@gmail.com
>>>> >:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Can somebody please provide some pointers on what could be the issue
>>>>> or how to debug it? We have a fairly large Ignite use case, but cannot go
>>>>> ahead with a POC because of these crashes.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Also, don't want to spam the mailing list with more threads, but I
>>>>>> get the same stability issue when writing to Ignite from Spark. Logfile
>>>>>> from the crashed node (not same node as before, probably random) is
>>>>>> attached.
>>>>>>
>>>>>>  I have also attached a gc log from another node (I have gc logging
>>>>>> enabled only on one node)
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Denis,
>>>>>>>
>>>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>  nohup.out
>>>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Eugene,
>>>>>>>>
>>>>>>>> Please share full logs from all the nodes and execution plan for
>>>>>>>> the query. That's what the community usually needs to help with
>>>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>>>
>>>>>>>> --
>>>>>>>> Denis
>>>>>>>>
>>>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>>>> - Full configs are attached.
>>>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>>>
>>>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>>>> I testing 2 things
>>>>>>>>> 1) Writing SQL tables from Spark
>>>>>>>>> 2) Performing large SQL queries (from the web console): for
>>>>>>>>> example Select COUNT (*) FROM (SELECT customer_id FROM MyTable
>>>>>>>>> where dt > '2018-05-12' GROUP BY customer_id having SUM(column1) > 2 AND
>>>>>>>>> MAX(column2) < 1)
>>>>>>>>>
>>>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>>>>> have also similar stability issues when writing from Spark - at some point,
>>>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>>>
>>>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>>>
>>>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>>>
>>>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on
>>>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>>>>
>>>>>>>>> [21:52:03] Ignite node stopped OK [name=Server,
>>>>>>>>> uptime=00:07:06.780]
>>>>>>>>>
>>>>>>>>> My questions are:
>>>>>>>>> 1) What is causing the issue?
>>>>>>>>> 2) How can I debug it better?
>>>>>>>>>
>>>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>>>> becoming quite a concern.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Node keeps crashing under load

Posted by eugene miretsky <eu...@gmail.com>.

Good question :)
yardstick does this, but not sure if it is a valid prod solution.
https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java

We have set preferIPv4Stack=true and provided localAddress in the config -
it seems to have solved the problem. (Didn't run it enough to be 100% sure)

On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> How would you distinguish the wrong interface (172.17.0.1) from the right
> one if you were Ignite?
>
> I think it's not the first time I have seen this problem but I have
> positively no idea how to tackle it.
> Maybe Docker experts could chime in?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 12 сент. 2018 г. в 3:29, eugene miretsky <eu...@gmail.com>:
>
>> Thanks Ilya,
>>
>> We are writing to Ignite from Spark running in EMR. We don't know the
>> address of the node in advance, we have tried
>> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
>> online
>> 2) Leave localHost unset, and let ignite figure out the host
>>
>> I have attached more logs at the end.
>>
>> My understanding is that Ignite should pick the first non-local address
>> to publish, however, it seems like it picks randomly one of (a) proper
>> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>>
>> A few questions:
>> 1) How do we force Spark client to use the proper address
>> 2) Where is 172.17.0.1 coming from? It is usually the default docker
>> network host address, and it seems like Ignite creates a network interface
>> for it on the instance. (otherwise I have no idea where the interface is
>> coming from)
>> 3) If there are communication errors, shouldn't the Zookeeper split brain
>> resolver kick in and shut down the dead node. Or shouldn't at least the
>> initiating node mark the remote node as dead?
>>
>> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>> rmtAddr=/172.21.86.7:41648]
>>
>> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>> rmtAddr=/127.0.0.1:37656]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.21.86.7:53272,
>> rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
>> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>>
>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
>> rmtAddr=/127.0.0.1:47100]
>>
>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection
>> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>> rmtAddr=/172.21.86.7:41656]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>> rmtAddr=/127.0.0.1:37664]
>>
>> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.21.86.7:41076,
>> rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]
>>
>>
>>
>>
>> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
>> ilya.kasnacheev@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I can see a lot of errors like this one:
>>>
>>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>>> Created new communication error process future
>>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>>> node still alive?). Make sure that each ComputeTask and cache Transaction
>>> has a timeout set in order to prevent parties from waiting forever in case
>>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>>
>>> I think the problem is, you have two nodes, they both have 172.17.0.1
>>> address but it's the different address (totally unrelated private nets).
>>>
>>> Try to specify your external address (such as 172.21.85.213) with
>>> TcpCommunicationSpi.setLocalAddress() on each node.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eugene.miretsky@gmail.com
>>> >:
>>>
>>>> Hi all,
>>>>
>>>> Can somebody please provide some pointers on what could be the issue or
>>>> how to debug it? We have a fairly large Ignite use case, but cannot go
>>>> ahead with a POC because of these crashes.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Also, don't want to spam the mailing list with more threads, but I get
>>>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>>>> the crashed node (not same node as before, probably random) is attached.
>>>>>
>>>>>  I have also attached a gc log from another node (I have gc logging
>>>>> enabled only on one node)
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Thanks Denis,
>>>>>>
>>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>  nohup.out
>>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Eugene,
>>>>>>>
>>>>>>> Please share full logs from all the nodes and execution plan for the
>>>>>>> query. That's what the community usually needs to help with
>>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>>> - Full configs are attached.
>>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>>
>>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>>> I testing 2 things
>>>>>>>> 1) Writing SQL tables from Spark
>>>>>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>>>
>>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>>>> have also similar stability issues when writing from Spark - at some point,
>>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>>
>>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>>
>>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>>
>>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on
>>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>>>
>>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>> 1) What is causing the issue?
>>>>>>>> 2) How can I debug it better?
>>>>>>>>
>>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>>> becoming quite a concern.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Node keeps crashing under load

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

How would you distinguish the wrong interface (172.17.0.1) from the right
one if you were Ignite?

I think it's not the first time I have seen this problem but I have
positively no idea how to tackle it.
Maybe Docker experts could chime in?

Regards,
-- 
Ilya Kasnacheev


ср, 12 сент. 2018 г. в 3:29, eugene miretsky <eu...@gmail.com>:

> Thanks Ilya,
>
> We are writing to Ignite from Spark running in EMR. We don't know the
> address of the node in advance, we have tried
> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
> online
> 2) Leave localHost unset, and let ignite figure out the host
>
> I have attached more logs at the end.
>
> My understanding is that Ignite should pick the first non-local address to
> publish, however, it seems like it picks randomly one of (a) proper
> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>
> A few questions:
> 1) How do we force Spark client to use the proper address
> 2) Where is 172.17.0.1 coming from? It is usually the default docker
> network host address, and it seems like Ignite creates a network interface
> for it on the instance. (otherwise I have no idea where the interface is
> coming from)
> 3) If there are communication errors, shouldn't the Zookeeper split brain
> resolver kick in and shut down the dead node. Or shouldn't at least the
> initiating node mark the remote node as dead?
>
> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
> rmtAddr=/172.21.86.7:41648]
>
> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
> rmtAddr=/127.0.0.1:37656]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.21.86.7:53272,
> rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>
> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
> rmtAddr=/127.0.0.1:47100]
>
> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
> Established outgoing communication connection
> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
> rmtAddr=/172.21.86.7:41656]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
> rmtAddr=/127.0.0.1:37664]
>
> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.21.86.7:41076,
> rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]
>
>
>
>
> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
> ilya.kasnacheev@gmail.com> wrote:
>
>> Hello!
>>
>> I can see a lot of errors like this one:
>>
>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>> Created new communication error process future
>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>> node still alive?). Make sure that each ComputeTask and cache Transaction
>> has a timeout set in order to prevent parties from waiting forever in case
>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>
>> I think the problem is, you have two nodes, they both have 172.17.0.1
>> address but it's the different address (totally unrelated private nets).
>>
>> Try to specify your external address (such as 172.21.85.213) with
>> TcpCommunicationSpi.setLocalAddress() on each node.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eu...@gmail.com>:
>>
>>> Hi all,
>>>
>>> Can somebody please provide some pointers on what could be the issue or
>>> how to debug it? We have a fairly large Ignite use case, but cannot go
>>> ahead with a POC because of these crashes.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Also, don't want to spam the mailing list with more threads, but I get
>>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>>> the crashed node (not same node as before, probably random) is attached.
>>>>
>>>>  I have also attached a gc log from another node (I have gc logging
>>>> enabled only on one node)
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Thanks Denis,
>>>>>
>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>  nohup.out
>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:
>>>>>
>>>>>> Eugene,
>>>>>>
>>>>>> Please share full logs from all the nodes and execution plan for the
>>>>>> query. That's what the community usually needs to help with
>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>
>>>>>> --
>>>>>> Denis
>>>>>>
>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>> - Full configs are attached.
>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>
>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>> I testing 2 things
>>>>>>> 1) Writing SQL tables from Spark
>>>>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>>
>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>>> have also similar stability issues when writing from Spark - at some point,
>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>
>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>
>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>
>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>>
>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>>
>>>>>>> My questions are:
>>>>>>> 1) What is causing the issue?
>>>>>>> 2) How can I debug it better?
>>>>>>>
>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>> becoming quite a concern.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Node keeps crashing under load

Posted by eugene miretsky <eu...@gmail.com>.

Thanks Ilya,

We are writing to Ignite from Spark running in EMR. We don't know the
address of the node in advance, we have tried
1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
online
2) Leave localHost unset, and let ignite figure out the host

I have attached more logs at the end.

My understanding is that Ignite should pick the first non-local address to
publish, however, it seems like it picks randomly one of (a) proper
address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.

A few questions:
1) How do we force Spark client to use the proper address
2) Where is 172.17.0.1 coming from? It is usually the default docker
network host address, and it seems like Ignite creates a network interface
for it on the instance. (otherwise I have no idea where the interface is
coming from)
3) If there are communication errors, shouldn't the Zookeeper split brain
resolver kick in and shut down the dead node. Or shouldn't at least the
initiating node mark the remote node as dead?

[19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
rmtAddr=/172.21.86.7:41648]

[19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
rmtAddr=/0:0:0:0:0:0:0:1:52484]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:37656]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.21.86.7:53272,
rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.17.0.1:41648,
rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]

[19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/127.0.0.1:37656,
rmtAddr=/127.0.0.1:47100]

[19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
Established outgoing communication connection
[locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
rmtAddr=/172.21.86.7:41656]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
rmtAddr=/0:0:0:0:0:0:0:1:52492]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:37664]

[19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.21.86.7:41076,
rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]




On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> I can see a lot of errors like this one:
>
> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
> Created new communication error process future
> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
> node still alive?). Make sure that each ComputeTask and cache Transaction
> has a timeout set in order to prevent parties from waiting forever in case
> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>
> I think the problem is, you have two nodes, they both have 172.17.0.1
> address but it's the different address (totally unrelated private nets).
>
> Try to specify your external address (such as 172.21.85.213) with
> TcpCommunicationSpi.setLocalAddress() on each node.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eu...@gmail.com>:
>
>> Hi all,
>>
>> Can somebody please provide some pointers on what could be the issue or
>> how to debug it? We have a fairly large Ignite use case, but cannot go
>> ahead with a POC because of these crashes.
>>
>> Cheers,
>> Eugene
>>
>>
>>
>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Also, don't want to spam the mailing list with more threads, but I get
>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>> the crashed node (not same node as before, probably random) is attached.
>>>
>>>  I have also attached a gc log from another node (I have gc logging
>>> enabled only on one node)
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Thanks Denis,
>>>>
>>>> Execution plan + all logs right after the carsh are attached.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>  nohup.out
>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:
>>>>
>>>>> Eugene,
>>>>>
>>>>> Please share full logs from all the nodes and execution plan for the
>>>>> query. That's what the community usually needs to help with
>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>> - Full configs are attached.
>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>
>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>> I testing 2 things
>>>>>> 1) Writing SQL tables from Spark
>>>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>
>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>> have also similar stability issues when writing from Spark - at some point,
>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>
>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>
>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>
>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>
>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>
>>>>>> My questions are:
>>>>>> 1) What is causing the issue?
>>>>>> 2) How can I debug it better?
>>>>>>
>>>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>>>> quite a concern.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Node keeps crashing under load

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

I can see a lot of errors like this one:

[04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
Created new communication error process future
[errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
node still alive?). Make sure that each ComputeTask and cache Transaction
has a timeout set in order to prevent parties from waiting forever in case
of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
/0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]

I think the problem is, you have two nodes, they both have 172.17.0.1
address but it's the different address (totally unrelated private nets).

Try to specify your external address (such as 172.21.85.213) with
TcpCommunicationSpi.setLocalAddress() on each node.

Regards,
-- 
Ilya Kasnacheev


пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eu...@gmail.com>:

> Hi all,
>
> Can somebody please provide some pointers on what could be the issue or
> how to debug it? We have a fairly large Ignite use case, but cannot go
> ahead with a POC because of these crashes.
>
> Cheers,
> Eugene
>
>
>
> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Also, don't want to spam the mailing list with more threads, but I get
>> the same stability issue when writing to Ignite from Spark. Logfile from
>> the crashed node (not same node as before, probably random) is attached.
>>
>>  I have also attached a gc log from another node (I have gc logging
>> enabled only on one node)
>>
>>
>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Denis,
>>>
>>> Execution plan + all logs right after the carsh are attached.
>>>
>>> Cheers,
>>> Eugene
>>>  nohup.out
>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>
>>>
>>>
>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:
>>>
>>>> Eugene,
>>>>
>>>> Please share full logs from all the nodes and execution plan for the
>>>> query. That's what the community usually needs to help with
>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>
>>>> --
>>>> Denis
>>>>
>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>> nodes. It has persistence enabled, and zero backup.
>>>>> - Full configs are attached.
>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>
>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>> I testing 2 things
>>>>> 1) Writing SQL tables from Spark
>>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>
>>>>> Most of the times I run the query it fails after one of the nodes
>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>> have also similar stability issues when writing from Spark - at some point,
>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>
>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>> system error detected. Will be handled accordingly to configured handler
>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>> [type=SEGMENTATION, err=null]]
>>>>>
>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>
>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>
>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>
>>>>> My questions are:
>>>>> 1) What is causing the issue?
>>>>> 2) How can I debug it better?
>>>>>
>>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>>> quite a concern.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>>

Re: Node keeps crashing under load

Posted by eugene miretsky <eu...@gmail.com>.

Hi all,

Can somebody please provide some pointers on what could be the issue or how
to debug it? We have a fairly large Ignite use case, but cannot go ahead
with a POC because of these crashes.

Cheers,
Eugene



On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <eu...@gmail.com>
wrote:

> Also, don't want to spam the mailing list with more threads, but I get the
> same stability issue when writing to Ignite from Spark. Logfile from the
> crashed node (not same node as before, probably random) is attached.
>
>  I have also attached a gc log from another node (I have gc logging
> enabled only on one node)
>
>
> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Thanks Denis,
>>
>> Execution plan + all logs right after the carsh are attached.
>>
>> Cheers,
>> Eugene
>>  nohup.out
>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>
>>
>>
>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:
>>
>>> Eugene,
>>>
>>> Please share full logs from all the nodes and execution plan for the
>>> query. That's what the community usually needs to help with
>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>
>>> --
>>> Denis
>>>
>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes.
>>>> It has persistence enabled, and zero backup.
>>>> - Full configs are attached.
>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>
>>>> The table has 145M rows, and takes up about 180G of memory
>>>> I testing 2 things
>>>> 1) Writing SQL tables from Spark
>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>
>>>> Most of the times I run the query it fails after one of the nodes
>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>> have also similar stability issues when writing from Spark - at some point,
>>>> one of the nodes crashes. All I can see in the logs is
>>>>
>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>> system error detected. Will be handled accordingly to configured handler
>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>> [type=SEGMENTATION, err=null]]
>>>>
>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>> Ignite node is in invalid state due to a critical failure.
>>>>
>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>
>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>
>>>> My questions are:
>>>> 1) What is causing the issue?
>>>> 2) How can I debug it better?
>>>>
>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>> quite a concern.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>>

Re: Node keeps crashing under load

Posted by eugene miretsky <eu...@gmail.com>.

Also, don't want to spam the mailing list with more threads, but I get the
same stability issue when writing to Ignite from Spark. Logfile from the
crashed node (not same node as before, probably random) is attached.

 I have also attached a gc log from another node (I have gc logging enabled
only on one node)


On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <eu...@gmail.com>
wrote:

> Thanks Denis,
>
> Execution plan + all logs right after the carsh are attached.
>
> Cheers,
> Eugene
>  nohup.out
> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>
>
>
> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:
>
>> Eugene,
>>
>> Please share full logs from all the nodes and execution plan for the
>> query. That's what the community usually needs to help with
>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>
>> --
>> Denis
>>
>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes.
>>> It has persistence enabled, and zero backup.
>>> - Full configs are attached.
>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>
>>> The table has 145M rows, and takes up about 180G of memory
>>> I testing 2 things
>>> 1) Writing SQL tables from Spark
>>> 2) Performing large SQL queries (from the web console): for example Select
>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>
>>> Most of the times I run the query it fails after one of the nodes
>>> crashes (it has finished a few times, and then crashed the next time). I
>>> have also similar stability issues when writing from Spark - at some point,
>>> one of the nodes crashes. All I can see in the logs is
>>>
>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>> system error detected. Will be handled accordingly to configured handler
>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>> [type=SEGMENTATION, err=null]]
>>>
>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>> Ignite node is in invalid state due to a critical failure.
>>>
>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>
>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>
>>> My questions are:
>>> 1) What is causing the issue?
>>> 2) How can I debug it better?
>>>
>>> The rate of crashes and our lack of ability to debug them is becoming
>>> quite a concern.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>>
>>>

Re: Node keeps crashing under load

Posted by eugene miretsky <eu...@gmail.com>.

Thanks Denis,

Execution plan + all logs right after the carsh are attached.

Cheers,
Eugene
 nohup.out
<https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>



On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dm...@apache.org> wrote:

> Eugene,
>
> Please share full logs from all the nodes and execution plan for the
> query. That's what the community usually needs to help with
> troubleshooting. Also, attach GC logs. Use these settings to gather them:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>
> --
> Denis
>
> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <eu...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes.
>> It has persistence enabled, and zero backup.
>> - Full configs are attached.
>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>
>> The table has 145M rows, and takes up about 180G of memory
>> I testing 2 things
>> 1) Writing SQL tables from Spark
>> 2) Performing large SQL queries (from the web console): for example Select
>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>
>> Most of the times I run the query it fails after one of the nodes crashes
>> (it has finished a few times, and then crashed the next time). I have also
>> similar stability issues when writing from Spark - at some point, one of
>> the nodes crashes. All I can see in the logs is
>>
>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical system
>> error detected. Will be handled accordingly to configured handler
>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>> [type=SEGMENTATION, err=null]]
>>
>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>> Ignite node is in invalid state due to a critical failure.
>>
>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>
>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>
>> My questions are:
>> 1) What is causing the issue?
>> 2) How can I debug it better?
>>
>> The rate of crashes and our lack of ability to debug them is becoming
>> quite a concern.
>>
>> Cheers,
>> Eugene
>>
>>
>>
>>

Re: Node keeps crashing under load

Posted by Denis Magda <dm...@apache.org>.

Eugene,

Please share full logs from all the nodes and execution plan for the query.
That's what the community usually needs to help with troubleshooting. Also,
attach GC logs. Use these settings to gather them:
https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats

--
Denis

On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <eu...@gmail.com>
wrote:

> Hello,
>
> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes. It
> has persistence enabled, and zero backup.
> - Full configs are attached.
> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server -XX:+AggressiveOpts
> -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>
> The table has 145M rows, and takes up about 180G of memory
> I testing 2 things
> 1) Writing SQL tables from Spark
> 2) Performing large SQL queries (from the web console): for example Select
> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>
> Most of the times I run the query it fails after one of the nodes crashes
> (it has finished a few times, and then crashed the next time). I have also
> similar stability issues when writing from Spark - at some point, one of
> the nodes crashes. All I can see in the logs is
>
> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical system
> error detected. Will be handled accordingly to configured handler
> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
> [type=SEGMENTATION, err=null]]
>
> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
> Ignite node is in invalid state due to a critical failure.
>
> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>
> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>
> My questions are:
> 1) What is causing the issue?
> 2) How can I debug it better?
>
> The rate of crashes and our lack of ability to debug them is becoming
> quite a concern.
>
> Cheers,
> Eugene
>
>
>
>