You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Chris Berry <ch...@gmail.com> on 2017/06/03 17:47:38 UTC

Ignite failing catastrophically

Hi,

I have a big problem. Ignite is failing catastrophically for me.

This is the scenario;

We start a Cluster of 15 Ignite Server Nodes.
These are initially empty.
Then some Kafka feeds are enabled that streams data into 4 independent
caches -- simultaneously (using DataStreamers)
Each cache is configured with 1 primary and 2 backups – and as a PARTITIONED
cache.
These attempt to load ~0.5M entries into each cache.
These Kafka feeds are streamed from a Client Node on 4 Threads into the
caches

Almost always a Node will fail during this operation.
And this will lead to a catastrophic, cascading failure of the entire
Cluster.

But on the failing Nodes, there is no information whatsoever as to what
caused the failure.
Nothing. No OOM. No Exceptions. Nothing.
The logs simply stop.
I have GC logging enabled, and there are no long pauses.
Thus, I am baffled

I have tried increasing memory.
I have tried increasing timeouts to ridiculous numbers;

```
COMPUTE_TASK_TIMEOUT=5000
DISCOVERY_ACK_TIMEOUT=30000
DISCOVERY_JOIN_TIMEOUT=120000
DISCOVERY_MAX_ACK_TIMEOUT=37000
DISCOVERY_NETWORK_TIMEOUT=120000
FAILURE_DETECTION_TIMEOUT=120000
IGNITE_LOG_LEVEL=INFO
IGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=200000
IGNITE_QUIET=false
```

But nothing helps.

What can I do to get better information out of Ignite??
It is basically failing silently.

Is there some tuning parameters that I am missing?
I would be happy to supply further config information.

This is with Ignite 2.0.0

We have invested quite a bit of effort to get Ignite running for our
application.
And this is a show-stopper for us.
NOTE: this does not happen with the smaller feeds that we have in our dev
environment.

Thanks,
-- Chris

--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Ignite-failing-catastrophically-tp13357.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Ignite failing catastrophically

Posted by Chris Berry <ch...@gmail.com>.

Following up.

This turned out to be an "external issue"
We are using Mesos/Marathon to run our Cluster on AWS (in Docker)
And I had not given the Container enough memory overhead.
Ignite 2.0.0 uses more off-heap memory and I needed to give the Container
considerably more memory to work in outside the Heap.

Thus, Mesos/Marathon was assassinating my Nodes (for exceeding their memory
allocation)
And why they suddenly stopped logging -- and were restarted.

So we can mark this one Solved.

Thanks, 
-- Chris 



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Ignite-failing-catastrophically-tp13357p13360.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.