You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Chris Berry <ch...@gmail.com> on 2017/01/21 15:27:27 UTC

OOMs with Compute Grid

Hello,
We are doing a simple POC to see if Ignite will work for our needs.
We built a small application that embeds Ignite and simulates what the real application does.

The real application maintains a large in-memory HashMap (25GB) of RatePlans and then computes Rates with them
The RatePlans are large. The Rates themselves are small. The computation is significant. 
The volume is high. Rate requests are for batches of 200 at a time. With each host seeing ~1500 requests/second
The latency is low. Requests take around 30ms at 95%-ile.
This HashMap is built to be sharded and distributed, but today we put it all on a single Node (25G)

The code that does all of this is complex. 
And we are hoping to use Ignite as a Compute Grid
So we built a simple system that simulates things.

And we have seen very disappointing/distrurbing behavior.
As we execute our simulated compute closure, we very quickly see an OOM
Which quickly destroys the entire cluster. 
The OOM spreads throughout the cluster, and it quickly dies a horrible death.

It looks like a memory leak. Memory grows with compute executions.
This behavior happens with both large batches (200) and with small batches (5).

I’ve tried tuning the GC as described in the Ignite docs.
And it slowed the leak down a bit, but not much.

What are we doing wrong?
Clearly others are not seeing this behavior.
Is there a new known memory leak?
Is there something about closures that is causing the system to hold onto data??

Thanks,
-- Chris 


The test
=======
7 Nodes – running in Docker (Mesos/Marathon) – in AWS 
Each Node w/ 5GB Heap, 6GB for the Container as a whole.
The Nodes discover each other via Consul.

The experiment is in 2 parts.

A. First, load in 150000 FakeRatePlans (0.058MB/FakeRatePlans == 8.7GB) – Plus 1 backups == 17.4GB over 7 Nodes
This is ~2.486GB/Node. 
So we have ~2.5GB free to play with per Node.

Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=1bebdf2e, name=null, uptime=00:16:30:046]
    ^-- H/N/C [hosts=2, nodes=7, CPUs=8]
    ^-- CPU [cur=0.17%, avg=1.36%, GC=0%]
    ^-- Heap [used=2273MB, free=55.31%, comm=5086MB]
    ^-- Non heap [used=79MB, free=-1%, comm=81MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=0, qSize=0]
    ^-- Outbound messages queue [size=0]

B. Second, start a “timed tester” on a single Node.
This tester does the following; 
For N seconds, 
a) Creates a batch of 200 random keys, out of the 150000 available
b) Executes a distributed compute against these 200 keys, ensuring that the Compute likely happens off-box
    The size of the data operated on is ~11.6MB
    The distributed compute is a closure. 
c) Passes back a stream of small FakeRates (0.1K/FakeRates == ~0.1MB) to the “tester Node”

The compute code looks like this:
Note, I have done my best to ensure no leaks here: 

    @Timed
    public List<FakeRate> distributeCalculation(Collection<String> cacheKeys, BigDecimal multiplier) {
        List<ComputeTaskFuture<FakeRate>> futures = null;
        Stream<ComputeTaskFuture<FakeRate>> stream = null;
        try {
            IgniteCompute compute = igniteClient.getCompute();
            String cacheName = igniteClient.getCacheName();
            IgniteCache<String, FakeRatePlan> cache = igniteClient.getRateMap();

            futures = new ArrayList<>(cacheKeys.size());

            for (String key : cacheKeys) {
                compute.affinityCall(cacheName, key, () -> {
                    FakeRatePlan ratePlan = cache.localPeek(key, CachePeekMode.ALL);

                    FakeRate rate = new FakeRate();
                    rate.setRateKey(ratePlan.getRateKey());
                    rate.setRateAlgorithm(ratePlan.getRateAlgorithm());
                    rate.setRate(ratePlan.getRate().multiply(multiplier));
                    rate.setRent(ratePlan.getRent().multiply(multiplier));
                    rate.setTax(ratePlan.getTax().multiply(multiplier));

                    return rate;
                });
                ComputeTaskFuture<FakeRate> future = compute.future();
                futures.add(future);
            }

            stream = futures.stream();
            List<FakeRate> rates = stream.map(IgniteFuture::get).collect(Collectors.toList());
            return rates;
        } finally {
            if (stream != null) {
                stream.close();
            }
            if (futures != null) {
                // cancel is a NoOp, if nothing needs cancelling
                for (ComputeTaskFuture<FakeRate> future : futures) {
                    future.cancel();
                }
            }
        }
    }

Re: OOMs with Compute Grid

Posted by vkulichenko <va...@gmail.com>.

Hi Chris,

Did you check the heap dump? What is consuming memory?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/OOMs-with-Compute-Grid-tp10169p10173.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.