You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Eduard Llull <ed...@llull.net> on 2020/02/05 08:43:55 UTC

Possible memory leak when using a near cache in Ignite.NET?

Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project.
We currently have six Ignite servers (started with ignite.sh) and a bunch
of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core
applications puts data into the cache and the other application is a gRPC
service that just reads that data to compute a response. The data is split
in a dozen of caches which are created programatically from the application
that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two
backups.

It's been working fine so far but we identified that one particular cache
was the most read and to reduce network usage and improve response time of
the gRPC service we decided to use a near cache. That particular cache has
~2300 entries which occupies ~110MB of space and the near cache is
configured with a maxSize=5000 and maxMemorySize=500000000

[image: image.png]

The embedded JVM in the gRPC .NET Core application is started with the
following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it
executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it
reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never
stabilizes: without the near cache the application uses ~2.5GB of RAM in
every server but wen we use the near cache, the application memory usage
never stops growing.

This is the memory usage of one of the servers with the gRPC application.
[image: image.png]

In the graph above, the version with the near cache was deployed on
February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
swapping and at arround 7:45 the application crashed. This is a detail:
[image: image.png]

I would very much like to create a reproducer but it looks like it would
take a very long time to execute the reproduce the issue as the gRPC
application needs several hours to use all the memory and if we take into
account that every server with the gRPC application receives around 90
requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

Hi Pavel,

To use the 2.8.0 alpha nuget would mean upgrading also the servers and
there is not an official Apache Ignite for the 2.8.0 version.

What I did is apply your commit to the 2.7.0 git tag (it applied almost
clean, I just needed to adjust the patch in two csproj). In my development
environment I can see that the Thread-NNNN threads get removed so I will
deploy our application compiled agains the patched 2.7.0 Ignite.NET to just
one server to see if it fixes the memory consumption issue.


On Fri, Feb 7, 2020 at 3:54 PM Pavel Tupitsyn <pt...@apache.org> wrote:

> Looks like you hit this bug:
> https://issues.apache.org/jira/browse/IGNITE-9638
>
> Near Cache is not related, it just increases memory usage and the leak
> becomes a problem sooner.
>
> The bug is fixed in the upcoming Ignite 2.8.
> Pre-release NuGet packages are available on nuget.org:
> https://www.nuget.org/packages/Apache.Ignite/2.8.0-alpha20200122
> Can you please try the latest pre-release package and see if it fixes the
> issue?
>
>
> On Fri, Feb 7, 2020 at 2:56 PM Eduard Llull <ed...@llull.net> wrote:
>
>> Sorry guys I forgot to attach the thread dump.
>>
>>
>>
>> On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <ed...@llull.net> wrote:
>>
>>> Hi,
>>>
>>> I've seen something I'm not able to explain but it might explain the
>>> increase in memory consumption.
>>>
>>> Last night I left a jconsole connected to one of the servers and today
>>> morning I've found this on the Threads graph:
>>> [image: image.png]
>>>
>>> Which correlates with the increase of memory:
>>> [image: image.png]
>>>
>>> I've done a thread dump to see what those threads are and 629 of them
>>> have a name in the format "Thread-nn", and the stacktrace of those threads
>>> is empty. The only information on the thread dump is:
>>>
>>>> "Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4
>>>> runnable [0x0000000000000000]
>>>>   java.lang.Thread.State: RUNNABLE
>>>>
>>>> "Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba
>>>> runnable [0x0000000000000000]
>>>>   java.lang.Thread.State: RUNNABLE
>>>>
>>>> "Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f
>>>> runnable [0x0000000000000000]
>>>>   java.lang.Thread.State: RUNNABLE
>>>>
>>>> "Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb
>>>> runnable [0x0000000000000000]
>>>>   java.lang.Thread.State: RUNNABLE
>>>>
>>>
>>> I connected a visualvm to the same server and I can see that more and
>>> more of these threads are created as time passes
>>> [image: image.png]
>>> [image: image.png]
>>> Although the thread state of these threads is Running, only a few seems
>>> to be executing something. I say so because using the CPU sampler during 2
>>> minutes and only a few of these threads do work:
>>> [image: image.png]
>>> A similar thing can be seen using the Memory sampler: just a few of
>>> these "Thread-nn" threads are currenclty allocating memory:
>>> [image: image.png]
>>>
>>> But the weird think is that the operating system (Ubuntu Linux BTW)
>>> reports only 153 threads:
>>> $ ls /proc/60528/task | wc -l
>>> 153
>>>
>>>
>>>
>>> On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <ed...@llull.net> wrote:
>>>
>>>> Limiting the MaxSize to 10 elements make a difference, the application
>>>> stabilized at 2600MB.
>>>>
>>>> But there is something weird with the CurrentMemorySize reported by the
>>>> neas cache through JMX. Currently it is showing a negative number:
>>>> [image: image.png]
>>>>
>>>> Today I will add more memory to one of the servers and in another I'll
>>>> raise the MaxSize and MaxMemorySize gradually and will track the change in
>>>> memory consumption on every change.
>>>>
>>>>
>>>> On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <pt...@apache.org>
>>>> wrote:
>>>>
>>>>> > do you mean a JVM heap dump
>>>>> yes
>>>>>
>>>>> > could the memory usage come from some .NET Core unmanaged code
>>>>> Very unlikely. Near Cache is a Java-only feature (.NET-native version
>>>>> is in the works), it does not cause any extra things to happen in .NET,
>>>>> besides passing the config initially.
>>>>>
>>>>> On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:
>>>>>
>>>>>> I just deployed a modified version of the application with the near
>>>>>> cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested
>>>>>> previously. Now I'll have to wait to see if there is any change in the
>>>>>> memory evolution.
>>>>>>
>>>>>> Regarding the heap dump you mention in the "further steps", do you
>>>>>> mean a JVM heap dump?
>>>>>>
>>>>>> An idea that is floating in my mind, could the memory usage come from
>>>>>> some .NET Core unmanaged code?
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Near Cache stores data on the JVM heap.
>>>>>>> Unmanaged ("offheap") memory and .NET Core heap should not be
>>>>>>> affected, and that is what we see on the graphs.
>>>>>>>
>>>>>>> Now we need to understand whether there is some memory leak on JVM
>>>>>>> heap, or we are simply running out of space there.
>>>>>>> You have 500MB limit for near cache, but this is counted using only
>>>>>>> raw data size, and does not account for per-entry overhead.
>>>>>>>
>>>>>>> So further steps are:
>>>>>>> - Either make limit smaller, or increase JVM heap - see if memory
>>>>>>> usage stabilizes at some point
>>>>>>> - If the above does not work, analyze heap dump to understand what
>>>>>>> causes memory consumption
>>>>>>>
>>>>>>> Keep us posted, and thanks for detailed reply.
>>>>>>>
>>>>>>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will try to change the MaxSize to 10 on just one of the servers
>>>>>>>> because it will have an impact to its response times. I'll send another
>>>>>>>> email when have some data after the change but it will take a few hours to
>>>>>>>> see the memory evolution.
>>>>>>>>
>>>>>>>> I have enabled remote JMX on the server I'm using to debug the
>>>>>>>> issue and I have graphs since the last time the application was started.
>>>>>>>> These are graphs from the old jconsole but I think they will be good enough
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Any difference you spot since 12:45 to 13:00 is because I removed
>>>>>>>> this particular server from the load balancer and I did a memory dump using
>>>>>>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>>>>>>> monitor` the .NET Core heap dances around 300MB
>>>>>>>>
>>>>>>>> This is the heap:
>>>>>>>> [image: image.png]
>>>>>>>> And this is the non-heap:
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> I reckon that the application might need a bigger heap as the
>>>>>>>> garbage collector is executing quite often. But thats not the problem I'm
>>>>>>>> trying to fix right now.
>>>>>>>>
>>>>>>>> Just for reference, this is the memory usage of the server where
>>>>>>>> that application runs.
>>>>>>>> [image: image.png]
>>>>>>>> And this is the working set in bytes reported by the .NET Core
>>>>>>>> application, where the client node runs (the time in this graph is in UTC
>>>>>>>> while the previous ones are in the +1 time zone) and if we compare this
>>>>>>>> graph with the previous one most of the memory usage in the server comes
>>>>>>>> from this application.
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> For completeness, this is the evolution of the .NET Core heap size
>>>>>>>> (the time in this graph is also in UTC while the previous ones are in the
>>>>>>>> +1 time zone):
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> So, just at the moment of writing the client node application has a
>>>>>>>> JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and
>>>>>>>> the .NET Core dances around 300MB. I have no clue about what is causing the
>>>>>>>> steadily increase in memory usage.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your support.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What if you reduce MaxSize to some small number, like 10, does it
>>>>>>>>> solve the problem?
>>>>>>>>> Can you please run jvisualvm and see what happens with the JVM
>>>>>>>>> heap?
>>>>>>>>>
>>>>>>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pavel,
>>>>>>>>>>
>>>>>>>>>> We have six servers, but these don't have any issue, and 40
>>>>>>>>>> client nodes (the Igntie node is started with
>>>>>>>>>> IgniteConfiguration.ClientMode = true).
>>>>>>>>>>
>>>>>>>>>> The 40 client nodes are the ones where we are having the memory
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes.
>>>>>>>>>> We also tried to using the following code but the memory issue was the same:
>>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>>> {
>>>>>>>>>> // Use LRU eviction policy to automatically evict entries
>>>>>>>>>> whenever it reaches 100000 in size.
>>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>>> {
>>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>>> }
>>>>>>>>>> };
>>>>>>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new
>>>>>>>>>> CacheConfiguration(cacheName), nearCacheCfg);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <
>>>>>>>>>> ptupitsyn@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Eduard,
>>>>>>>>>>>
>>>>>>>>>>> Do you have any client nodes
>>>>>>>>>>> (IgniteConfiguration.ClientMode=true), or just servers?
>>>>>>>>>>>
>>>>>>>>>>> Is the following line executed on Ignite server node?
>>>>>>>>>>> _ignite.GetOrCreateNearCache
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> We have been using Ignite and Ignite.NET in the recent months
>>>>>>>>>>>> in a project. We currently have six Ignite servers (started with ignite.sh)
>>>>>>>>>>>> and a bunch of thick clients split in two .NET Core application deployed in
>>>>>>>>>>>> 30 servers.
>>>>>>>>>>>>
>>>>>>>>>>>> We store de-normalized data in the Ignite data grid: one of the
>>>>>>>>>>>> .NET Core applications puts data into the cache and the other application
>>>>>>>>>>>> is a gRPC service that just reads that data to compute a response. The data
>>>>>>>>>>>> is split in a dozen of caches which are created programatically from the
>>>>>>>>>>>> application that writes into the caches.
>>>>>>>>>>>>
>>>>>>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions
>>>>>>>>>>>> have two backups.
>>>>>>>>>>>>
>>>>>>>>>>>> It's been working fine so far but we identified that one
>>>>>>>>>>>> particular cache was the most read and to reduce network usage and improve
>>>>>>>>>>>> response time of the gRPC service we decided to use a near cache. That
>>>>>>>>>>>> particular cache has ~2300 entries which occupies ~110MB of space and the
>>>>>>>>>>>> near cache is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>>>>>>
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>
>>>>>>>>>>>> The embedded JVM in the gRPC .NET Core application is started
>>>>>>>>>>>> with the following parameters:
>>>>>>>>>>>> -Xmx=1024
>>>>>>>>>>>> -Xms=1024
>>>>>>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>>>>>>> -Xrs
>>>>>>>>>>>> -XX:+AlwaysPreTouch
>>>>>>>>>>>> -XX:+UseG1GC
>>>>>>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>>>>>>
>>>>>>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>>>>>>
>>>>>>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>>>>> {
>>>>>>>>>>>> // Use LRU eviction policy to automatically evict entries
>>>>>>>>>>>> whenever it reaches 100000 in size.
>>>>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>>>>> {
>>>>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>>>>> }
>>>>>>>>>>>> };
>>>>>>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>>>>>>> nearCacheCfg);
>>>>>>>>>>>>
>>>>>>>>>>>> But since we added the near cache the application memory usage
>>>>>>>>>>>> never stabilizes: without the near cache the application uses ~2.5GB of RAM
>>>>>>>>>>>> in every server but wen we use the near cache, the application memory usage
>>>>>>>>>>>> never stops growing.
>>>>>>>>>>>>
>>>>>>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>>>>>>> application.
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>
>>>>>>>>>>>> In the graph above, the version with the near cache was
>>>>>>>>>>>> deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the
>>>>>>>>>>>> server started swapping and at arround 7:45 the application crashed. This
>>>>>>>>>>>> is a detail:
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>
>>>>>>>>>>>> I would very much like to create a reproducer but it looks like
>>>>>>>>>>>> it would take a very long time to execute the reproduce the issue as the
>>>>>>>>>>>> gRPC application needs several hours to use all the memory and if we take
>>>>>>>>>>>> into account that every server with the gRPC application receives around 90
>>>>>>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anybody have any idea where the problem can be or how to
>>>>>>>>>>>> find it?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you very much
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Pavel Tupitsyn <pt...@apache.org>.

Looks like you hit this bug:
https://issues.apache.org/jira/browse/IGNITE-9638

Near Cache is not related, it just increases memory usage and the leak
becomes a problem sooner.

The bug is fixed in the upcoming Ignite 2.8.
Pre-release NuGet packages are available on nuget.org:
https://www.nuget.org/packages/Apache.Ignite/2.8.0-alpha20200122
Can you please try the latest pre-release package and see if it fixes the
issue?


On Fri, Feb 7, 2020 at 2:56 PM Eduard Llull <ed...@llull.net> wrote:

> Sorry guys I forgot to attach the thread dump.
>
>
>
> On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <ed...@llull.net> wrote:
>
>> Hi,
>>
>> I've seen something I'm not able to explain but it might explain the
>> increase in memory consumption.
>>
>> Last night I left a jconsole connected to one of the servers and today
>> morning I've found this on the Threads graph:
>> [image: image.png]
>>
>> Which correlates with the increase of memory:
>> [image: image.png]
>>
>> I've done a thread dump to see what those threads are and 629 of them
>> have a name in the format "Thread-nn", and the stacktrace of those threads
>> is empty. The only information on the thread dump is:
>>
>>> "Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4
>>> runnable [0x0000000000000000]
>>>   java.lang.Thread.State: RUNNABLE
>>>
>>> "Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba
>>> runnable [0x0000000000000000]
>>>   java.lang.Thread.State: RUNNABLE
>>>
>>> "Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f
>>> runnable [0x0000000000000000]
>>>   java.lang.Thread.State: RUNNABLE
>>>
>>> "Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb
>>> runnable [0x0000000000000000]
>>>   java.lang.Thread.State: RUNNABLE
>>>
>>
>> I connected a visualvm to the same server and I can see that more and
>> more of these threads are created as time passes
>> [image: image.png]
>> [image: image.png]
>> Although the thread state of these threads is Running, only a few seems
>> to be executing something. I say so because using the CPU sampler during 2
>> minutes and only a few of these threads do work:
>> [image: image.png]
>> A similar thing can be seen using the Memory sampler: just a few of these
>> "Thread-nn" threads are currenclty allocating memory:
>> [image: image.png]
>>
>> But the weird think is that the operating system (Ubuntu Linux BTW)
>> reports only 153 threads:
>> $ ls /proc/60528/task | wc -l
>> 153
>>
>>
>>
>> On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <ed...@llull.net> wrote:
>>
>>> Limiting the MaxSize to 10 elements make a difference, the application
>>> stabilized at 2600MB.
>>>
>>> But there is something weird with the CurrentMemorySize reported by the
>>> neas cache through JMX. Currently it is showing a negative number:
>>> [image: image.png]
>>>
>>> Today I will add more memory to one of the servers and in another I'll
>>> raise the MaxSize and MaxMemorySize gradually and will track the change in
>>> memory consumption on every change.
>>>
>>>
>>> On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <pt...@apache.org>
>>> wrote:
>>>
>>>> > do you mean a JVM heap dump
>>>> yes
>>>>
>>>> > could the memory usage come from some .NET Core unmanaged code
>>>> Very unlikely. Near Cache is a Java-only feature (.NET-native version
>>>> is in the works), it does not cause any extra things to happen in .NET,
>>>> besides passing the config initially.
>>>>
>>>> On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:
>>>>
>>>>> I just deployed a modified version of the application with the near
>>>>> cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested
>>>>> previously. Now I'll have to wait to see if there is any change in the
>>>>> memory evolution.
>>>>>
>>>>> Regarding the heap dump you mention in the "further steps", do you
>>>>> mean a JVM heap dump?
>>>>>
>>>>> An idea that is floating in my mind, could the memory usage come from
>>>>> some .NET Core unmanaged code?
>>>>>
>>>>>
>>>>> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Near Cache stores data on the JVM heap.
>>>>>> Unmanaged ("offheap") memory and .NET Core heap should not be
>>>>>> affected, and that is what we see on the graphs.
>>>>>>
>>>>>> Now we need to understand whether there is some memory leak on JVM
>>>>>> heap, or we are simply running out of space there.
>>>>>> You have 500MB limit for near cache, but this is counted using only
>>>>>> raw data size, and does not account for per-entry overhead.
>>>>>>
>>>>>> So further steps are:
>>>>>> - Either make limit smaller, or increase JVM heap - see if memory
>>>>>> usage stabilizes at some point
>>>>>> - If the above does not work, analyze heap dump to understand what
>>>>>> causes memory consumption
>>>>>>
>>>>>> Keep us posted, and thanks for detailed reply.
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>>>>>>
>>>>>>> I will try to change the MaxSize to 10 on just one of the servers
>>>>>>> because it will have an impact to its response times. I'll send another
>>>>>>> email when have some data after the change but it will take a few hours to
>>>>>>> see the memory evolution.
>>>>>>>
>>>>>>> I have enabled remote JMX on the server I'm using to debug the issue
>>>>>>> and I have graphs since the last time the application was started. These
>>>>>>> are graphs from the old jconsole but I think they will be good enough for
>>>>>>> you.
>>>>>>>
>>>>>>> Any difference you spot since 12:45 to 13:00 is because I removed
>>>>>>> this particular server from the load balancer and I did a memory dump using
>>>>>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>>>>>> monitor` the .NET Core heap dances around 300MB
>>>>>>>
>>>>>>> This is the heap:
>>>>>>> [image: image.png]
>>>>>>> And this is the non-heap:
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> I reckon that the application might need a bigger heap as the
>>>>>>> garbage collector is executing quite often. But thats not the problem I'm
>>>>>>> trying to fix right now.
>>>>>>>
>>>>>>> Just for reference, this is the memory usage of the server where
>>>>>>> that application runs.
>>>>>>> [image: image.png]
>>>>>>> And this is the working set in bytes reported by the .NET Core
>>>>>>> application, where the client node runs (the time in this graph is in UTC
>>>>>>> while the previous ones are in the +1 time zone) and if we compare this
>>>>>>> graph with the previous one most of the memory usage in the server comes
>>>>>>> from this application.
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> For completeness, this is the evolution of the .NET Core heap size
>>>>>>> (the time in this graph is also in UTC while the previous ones are in the
>>>>>>> +1 time zone):
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> So, just at the moment of writing the client node application has a
>>>>>>> JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and
>>>>>>> the .NET Core dances around 300MB. I have no clue about what is causing the
>>>>>>> steadily increase in memory usage.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your support.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What if you reduce MaxSize to some small number, like 10, does it
>>>>>>>> solve the problem?
>>>>>>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>>>>>>
>>>>>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Pavel,
>>>>>>>>>
>>>>>>>>> We have six servers, but these don't have any issue, and 40 client
>>>>>>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>>>>>>> true).
>>>>>>>>>
>>>>>>>>> The 40 client nodes are the ones where we are having the memory
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes.
>>>>>>>>> We also tried to using the following code but the memory issue was the same:
>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>> {
>>>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>>>> it reaches 100000 in size.
>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>> {
>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>> }
>>>>>>>>> };
>>>>>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new
>>>>>>>>> CacheConfiguration(cacheName), nearCacheCfg);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <
>>>>>>>>> ptupitsyn@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Eduard,
>>>>>>>>>>
>>>>>>>>>> Do you have any client nodes
>>>>>>>>>> (IgniteConfiguration.ClientMode=true), or just servers?
>>>>>>>>>>
>>>>>>>>>> Is the following line executed on Ignite server node?
>>>>>>>>>> _ignite.GetOrCreateNearCache
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> We have been using Ignite and Ignite.NET in the recent months in
>>>>>>>>>>> a project. We currently have six Ignite servers (started with ignite.sh)
>>>>>>>>>>> and a bunch of thick clients split in two .NET Core application deployed in
>>>>>>>>>>> 30 servers.
>>>>>>>>>>>
>>>>>>>>>>> We store de-normalized data in the Ignite data grid: one of the
>>>>>>>>>>> .NET Core applications puts data into the cache and the other application
>>>>>>>>>>> is a gRPC service that just reads that data to compute a response. The data
>>>>>>>>>>> is split in a dozen of caches which are created programatically from the
>>>>>>>>>>> application that writes into the caches.
>>>>>>>>>>>
>>>>>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions
>>>>>>>>>>> have two backups.
>>>>>>>>>>>
>>>>>>>>>>> It's been working fine so far but we identified that one
>>>>>>>>>>> particular cache was the most read and to reduce network usage and improve
>>>>>>>>>>> response time of the gRPC service we decided to use a near cache. That
>>>>>>>>>>> particular cache has ~2300 entries which occupies ~110MB of space and the
>>>>>>>>>>> near cache is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>>>>>
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> The embedded JVM in the gRPC .NET Core application is started
>>>>>>>>>>> with the following parameters:
>>>>>>>>>>> -Xmx=1024
>>>>>>>>>>> -Xms=1024
>>>>>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>>>>>> -Xrs
>>>>>>>>>>> -XX:+AlwaysPreTouch
>>>>>>>>>>> -XX:+UseG1GC
>>>>>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>>>>>
>>>>>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>>>>>
>>>>>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>>>> {
>>>>>>>>>>> // Use LRU eviction policy to automatically evict entries
>>>>>>>>>>> whenever it reaches 100000 in size.
>>>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>>>> {
>>>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>>>> }
>>>>>>>>>>> };
>>>>>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>>>>>> nearCacheCfg);
>>>>>>>>>>>
>>>>>>>>>>> But since we added the near cache the application memory usage
>>>>>>>>>>> never stabilizes: without the near cache the application uses ~2.5GB of RAM
>>>>>>>>>>> in every server but wen we use the near cache, the application memory usage
>>>>>>>>>>> never stops growing.
>>>>>>>>>>>
>>>>>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>>>>>> application.
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> In the graph above, the version with the near cache was deployed
>>>>>>>>>>> on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server
>>>>>>>>>>> started swapping and at arround 7:45 the application crashed. This is a
>>>>>>>>>>> detail:
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> I would very much like to create a reproducer but it looks like
>>>>>>>>>>> it would take a very long time to execute the reproduce the issue as the
>>>>>>>>>>> gRPC application needs several hours to use all the memory and if we take
>>>>>>>>>>> into account that every server with the gRPC application receives around 90
>>>>>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>>>>>
>>>>>>>>>>> Does anybody have any idea where the problem can be or how to
>>>>>>>>>>> find it?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much
>>>>>>>>>>>
>>>>>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

Sorry guys I forgot to attach the thread dump.



On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <ed...@llull.net> wrote:

> Hi,
>
> I've seen something I'm not able to explain but it might explain the
> increase in memory consumption.
>
> Last night I left a jconsole connected to one of the servers and today
> morning I've found this on the Threads graph:
> [image: image.png]
>
> Which correlates with the increase of memory:
> [image: image.png]
>
> I've done a thread dump to see what those threads are and 629 of them have
> a name in the format "Thread-nn", and the stacktrace of those threads is
> empty. The only information on the thread dump is:
>
>> "Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4
>> runnable [0x0000000000000000]
>>   java.lang.Thread.State: RUNNABLE
>>
>> "Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba
>> runnable [0x0000000000000000]
>>   java.lang.Thread.State: RUNNABLE
>>
>> "Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f
>> runnable [0x0000000000000000]
>>   java.lang.Thread.State: RUNNABLE
>>
>> "Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb
>> runnable [0x0000000000000000]
>>   java.lang.Thread.State: RUNNABLE
>>
>
> I connected a visualvm to the same server and I can see that more and more
> of these threads are created as time passes
> [image: image.png]
> [image: image.png]
> Although the thread state of these threads is Running, only a few seems to
> be executing something. I say so because using the CPU sampler during 2
> minutes and only a few of these threads do work:
> [image: image.png]
> A similar thing can be seen using the Memory sampler: just a few of these
> "Thread-nn" threads are currenclty allocating memory:
> [image: image.png]
>
> But the weird think is that the operating system (Ubuntu Linux BTW)
> reports only 153 threads:
> $ ls /proc/60528/task | wc -l
> 153
>
>
>
> On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <ed...@llull.net> wrote:
>
>> Limiting the MaxSize to 10 elements make a difference, the application
>> stabilized at 2600MB.
>>
>> But there is something weird with the CurrentMemorySize reported by the
>> neas cache through JMX. Currently it is showing a negative number:
>> [image: image.png]
>>
>> Today I will add more memory to one of the servers and in another I'll
>> raise the MaxSize and MaxMemorySize gradually and will track the change in
>> memory consumption on every change.
>>
>>
>> On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> > do you mean a JVM heap dump
>>> yes
>>>
>>> > could the memory usage come from some .NET Core unmanaged code
>>> Very unlikely. Near Cache is a Java-only feature (.NET-native version is
>>> in the works), it does not cause any extra things to happen in .NET,
>>> besides passing the config initially.
>>>
>>> On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:
>>>
>>>> I just deployed a modified version of the application with the near
>>>> cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested
>>>> previously. Now I'll have to wait to see if there is any change in the
>>>> memory evolution.
>>>>
>>>> Regarding the heap dump you mention in the "further steps", do you mean
>>>> a JVM heap dump?
>>>>
>>>> An idea that is floating in my mind, could the memory usage come from
>>>> some .NET Core unmanaged code?
>>>>
>>>>
>>>> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
>>>> wrote:
>>>>
>>>>> Near Cache stores data on the JVM heap.
>>>>> Unmanaged ("offheap") memory and .NET Core heap should not be
>>>>> affected, and that is what we see on the graphs.
>>>>>
>>>>> Now we need to understand whether there is some memory leak on JVM
>>>>> heap, or we are simply running out of space there.
>>>>> You have 500MB limit for near cache, but this is counted using only
>>>>> raw data size, and does not account for per-entry overhead.
>>>>>
>>>>> So further steps are:
>>>>> - Either make limit smaller, or increase JVM heap - see if memory
>>>>> usage stabilizes at some point
>>>>> - If the above does not work, analyze heap dump to understand what
>>>>> causes memory consumption
>>>>>
>>>>> Keep us posted, and thanks for detailed reply.
>>>>>
>>>>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>>>>>
>>>>>> I will try to change the MaxSize to 10 on just one of the servers
>>>>>> because it will have an impact to its response times. I'll send another
>>>>>> email when have some data after the change but it will take a few hours to
>>>>>> see the memory evolution.
>>>>>>
>>>>>> I have enabled remote JMX on the server I'm using to debug the issue
>>>>>> and I have graphs since the last time the application was started. These
>>>>>> are graphs from the old jconsole but I think they will be good enough for
>>>>>> you.
>>>>>>
>>>>>> Any difference you spot since 12:45 to 13:00 is because I removed
>>>>>> this particular server from the load balancer and I did a memory dump using
>>>>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>>>>> monitor` the .NET Core heap dances around 300MB
>>>>>>
>>>>>> This is the heap:
>>>>>> [image: image.png]
>>>>>> And this is the non-heap:
>>>>>> [image: image.png]
>>>>>>
>>>>>> I reckon that the application might need a bigger heap as the garbage
>>>>>> collector is executing quite often. But thats not the problem I'm trying to
>>>>>> fix right now.
>>>>>>
>>>>>> Just for reference, this is the memory usage of the server where that
>>>>>> application runs.
>>>>>> [image: image.png]
>>>>>> And this is the working set in bytes reported by the .NET Core
>>>>>> application, where the client node runs (the time in this graph is in UTC
>>>>>> while the previous ones are in the +1 time zone) and if we compare this
>>>>>> graph with the previous one most of the memory usage in the server comes
>>>>>> from this application.
>>>>>> [image: image.png]
>>>>>>
>>>>>> For completeness, this is the evolution of the .NET Core heap size
>>>>>> (the time in this graph is also in UTC while the previous ones are in the
>>>>>> +1 time zone):
>>>>>> [image: image.png]
>>>>>>
>>>>>> So, just at the moment of writing the client node application has a
>>>>>> JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and
>>>>>> the .NET Core dances around 300MB. I have no clue about what is causing the
>>>>>> steadily increase in memory usage.
>>>>>>
>>>>>>
>>>>>> Thanks for your support.
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> What if you reduce MaxSize to some small number, like 10, does it
>>>>>>> solve the problem?
>>>>>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>>>>>
>>>>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Pavel,
>>>>>>>>
>>>>>>>> We have six servers, but these don't have any issue, and 40 client
>>>>>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>>>>>> true).
>>>>>>>>
>>>>>>>> The 40 client nodes are the ones where we are having the memory
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We
>>>>>>>> also tried to using the following code but the memory issue was the same:
>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>> {
>>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>>> it reaches 100000 in size.
>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>> {
>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>> MaxMemorySize = 500000000
>>>>>>>> }
>>>>>>>> };
>>>>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new
>>>>>>>> CacheConfiguration(cacheName), nearCacheCfg);
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <
>>>>>>>> ptupitsyn@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hi Eduard,
>>>>>>>>>
>>>>>>>>> Do you have any client nodes
>>>>>>>>> (IgniteConfiguration.ClientMode=true), or just servers?
>>>>>>>>>
>>>>>>>>> Is the following line executed on Ignite server node?
>>>>>>>>> _ignite.GetOrCreateNearCache
>>>>>>>>>
>>>>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> We have been using Ignite and Ignite.NET in the recent months in
>>>>>>>>>> a project. We currently have six Ignite servers (started with ignite.sh)
>>>>>>>>>> and a bunch of thick clients split in two .NET Core application deployed in
>>>>>>>>>> 30 servers.
>>>>>>>>>>
>>>>>>>>>> We store de-normalized data in the Ignite data grid: one of the
>>>>>>>>>> .NET Core applications puts data into the cache and the other application
>>>>>>>>>> is a gRPC service that just reads that data to compute a response. The data
>>>>>>>>>> is split in a dozen of caches which are created programatically from the
>>>>>>>>>> application that writes into the caches.
>>>>>>>>>>
>>>>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions
>>>>>>>>>> have two backups.
>>>>>>>>>>
>>>>>>>>>> It's been working fine so far but we identified that one
>>>>>>>>>> particular cache was the most read and to reduce network usage and improve
>>>>>>>>>> response time of the gRPC service we decided to use a near cache. That
>>>>>>>>>> particular cache has ~2300 entries which occupies ~110MB of space and the
>>>>>>>>>> near cache is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>>>>
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> The embedded JVM in the gRPC .NET Core application is started
>>>>>>>>>> with the following parameters:
>>>>>>>>>> -Xmx=1024
>>>>>>>>>> -Xms=1024
>>>>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>>>>> -Xrs
>>>>>>>>>> -XX:+AlwaysPreTouch
>>>>>>>>>> -XX:+UseG1GC
>>>>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>>>>
>>>>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>>>>
>>>>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>>> {
>>>>>>>>>> // Use LRU eviction policy to automatically evict entries
>>>>>>>>>> whenever it reaches 100000 in size.
>>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>>> {
>>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>>> }
>>>>>>>>>> };
>>>>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>>>>> nearCacheCfg);
>>>>>>>>>>
>>>>>>>>>> But since we added the near cache the application memory usage
>>>>>>>>>> never stabilizes: without the near cache the application uses ~2.5GB of RAM
>>>>>>>>>> in every server but wen we use the near cache, the application memory usage
>>>>>>>>>> never stops growing.
>>>>>>>>>>
>>>>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>>>>> application.
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> In the graph above, the version with the near cache was deployed
>>>>>>>>>> on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server
>>>>>>>>>> started swapping and at arround 7:45 the application crashed. This is a
>>>>>>>>>> detail:
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> I would very much like to create a reproducer but it looks like
>>>>>>>>>> it would take a very long time to execute the reproduce the issue as the
>>>>>>>>>> gRPC application needs several hours to use all the memory and if we take
>>>>>>>>>> into account that every server with the gRPC application receives around 90
>>>>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>>>>
>>>>>>>>>> Does anybody have any idea where the problem can be or how to
>>>>>>>>>> find it?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you very much
>>>>>>>>>>
>>>>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

Hi,

I've seen something I'm not able to explain but it might explain the
increase in memory consumption.

Last night I left a jconsole connected to one of the servers and today
morning I've found this on the Threads graph:
[image: image.png]

Which correlates with the increase of memory:
[image: image.png]

I've done a thread dump to see what those threads are and 629 of them have
a name in the format "Thread-nn", and the stacktrace of those threads is
empty. The only information on the thread dump is:

> "Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>

I connected a visualvm to the same server and I can see that more and more
of these threads are created as time passes
[image: image.png]
[image: image.png]
Although the thread state of these threads is Running, only a few seems to
be executing something. I say so because using the CPU sampler during 2
minutes and only a few of these threads do work:
[image: image.png]
A similar thing can be seen using the Memory sampler: just a few of these
"Thread-nn" threads are currenclty allocating memory:
[image: image.png]

But the weird think is that the operating system (Ubuntu Linux BTW) reports
only 153 threads:
$ ls /proc/60528/task | wc -l
153



On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <ed...@llull.net> wrote:

> Limiting the MaxSize to 10 elements make a difference, the application
> stabilized at 2600MB.
>
> But there is something weird with the CurrentMemorySize reported by the
> neas cache through JMX. Currently it is showing a negative number:
> [image: image.png]
>
> Today I will add more memory to one of the servers and in another I'll
> raise the MaxSize and MaxMemorySize gradually and will track the change in
> memory consumption on every change.
>
>
> On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> > do you mean a JVM heap dump
>> yes
>>
>> > could the memory usage come from some .NET Core unmanaged code
>> Very unlikely. Near Cache is a Java-only feature (.NET-native version is
>> in the works), it does not cause any extra things to happen in .NET,
>> besides passing the config initially.
>>
>> On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:
>>
>>> I just deployed a modified version of the application with the near
>>> cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested
>>> previously. Now I'll have to wait to see if there is any change in the
>>> memory evolution.
>>>
>>> Regarding the heap dump you mention in the "further steps", do you mean
>>> a JVM heap dump?
>>>
>>> An idea that is floating in my mind, could the memory usage come from
>>> some .NET Core unmanaged code?
>>>
>>>
>>> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
>>> wrote:
>>>
>>>> Near Cache stores data on the JVM heap.
>>>> Unmanaged ("offheap") memory and .NET Core heap should not be affected,
>>>> and that is what we see on the graphs.
>>>>
>>>> Now we need to understand whether there is some memory leak on JVM
>>>> heap, or we are simply running out of space there.
>>>> You have 500MB limit for near cache, but this is counted using only raw
>>>> data size, and does not account for per-entry overhead.
>>>>
>>>> So further steps are:
>>>> - Either make limit smaller, or increase JVM heap - see if memory usage
>>>> stabilizes at some point
>>>> - If the above does not work, analyze heap dump to understand what
>>>> causes memory consumption
>>>>
>>>> Keep us posted, and thanks for detailed reply.
>>>>
>>>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>>>>
>>>>> I will try to change the MaxSize to 10 on just one of the servers
>>>>> because it will have an impact to its response times. I'll send another
>>>>> email when have some data after the change but it will take a few hours to
>>>>> see the memory evolution.
>>>>>
>>>>> I have enabled remote JMX on the server I'm using to debug the issue
>>>>> and I have graphs since the last time the application was started. These
>>>>> are graphs from the old jconsole but I think they will be good enough for
>>>>> you.
>>>>>
>>>>> Any difference you spot since 12:45 to 13:00 is because I removed this
>>>>> particular server from the load balancer and I did a memory dump using
>>>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>>>> monitor` the .NET Core heap dances around 300MB
>>>>>
>>>>> This is the heap:
>>>>> [image: image.png]
>>>>> And this is the non-heap:
>>>>> [image: image.png]
>>>>>
>>>>> I reckon that the application might need a bigger heap as the garbage
>>>>> collector is executing quite often. But thats not the problem I'm trying to
>>>>> fix right now.
>>>>>
>>>>> Just for reference, this is the memory usage of the server where that
>>>>> application runs.
>>>>> [image: image.png]
>>>>> And this is the working set in bytes reported by the .NET Core
>>>>> application, where the client node runs (the time in this graph is in UTC
>>>>> while the previous ones are in the +1 time zone) and if we compare this
>>>>> graph with the previous one most of the memory usage in the server comes
>>>>> from this application.
>>>>> [image: image.png]
>>>>>
>>>>> For completeness, this is the evolution of the .NET Core heap size
>>>>> (the time in this graph is also in UTC while the previous ones are in the
>>>>> +1 time zone):
>>>>> [image: image.png]
>>>>>
>>>>> So, just at the moment of writing the client node application has a
>>>>> JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and
>>>>> the .NET Core dances around 300MB. I have no clue about what is causing the
>>>>> steadily increase in memory usage.
>>>>>
>>>>>
>>>>> Thanks for your support.
>>>>>
>>>>>
>>>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> What if you reduce MaxSize to some small number, like 10, does it
>>>>>> solve the problem?
>>>>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Pavel,
>>>>>>>
>>>>>>> We have six servers, but these don't have any issue, and 40 client
>>>>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>>>>> true).
>>>>>>>
>>>>>>> The 40 client nodes are the ones where we are having the memory
>>>>>>> issue.
>>>>>>>
>>>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We
>>>>>>> also tried to using the following code but the memory issue was the same:
>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>> {
>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>> it reaches 100000 in size.
>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>> {
>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>> MaxMemorySize = 500000000
>>>>>>> }
>>>>>>> };
>>>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration
>>>>>>> (cacheName), nearCacheCfg);
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Eduard,
>>>>>>>>
>>>>>>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true),
>>>>>>>> or just servers?
>>>>>>>>
>>>>>>>> Is the following line executed on Ignite server node?
>>>>>>>> _ignite.GetOrCreateNearCache
>>>>>>>>
>>>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>>>>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>>>>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>>>>>>> servers.
>>>>>>>>>
>>>>>>>>> We store de-normalized data in the Ignite data grid: one of the
>>>>>>>>> .NET Core applications puts data into the cache and the other application
>>>>>>>>> is a gRPC service that just reads that data to compute a response. The data
>>>>>>>>> is split in a dozen of caches which are created programatically from the
>>>>>>>>> application that writes into the caches.
>>>>>>>>>
>>>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions
>>>>>>>>> have two backups.
>>>>>>>>>
>>>>>>>>> It's been working fine so far but we identified that one
>>>>>>>>> particular cache was the most read and to reduce network usage and improve
>>>>>>>>> response time of the gRPC service we decided to use a near cache. That
>>>>>>>>> particular cache has ~2300 entries which occupies ~110MB of space and the
>>>>>>>>> near cache is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> The embedded JVM in the gRPC .NET Core application is started with
>>>>>>>>> the following parameters:
>>>>>>>>> -Xmx=1024
>>>>>>>>> -Xms=1024
>>>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>>>> -Xrs
>>>>>>>>> -XX:+AlwaysPreTouch
>>>>>>>>> -XX:+UseG1GC
>>>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>>>
>>>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>>>
>>>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>>> {
>>>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>>>> it reaches 100000 in size.
>>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>>> {
>>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>>> MaxMemorySize = 500000000
>>>>>>>>> }
>>>>>>>>> };
>>>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>>>> nearCacheCfg);
>>>>>>>>>
>>>>>>>>> But since we added the near cache the application memory usage
>>>>>>>>> never stabilizes: without the near cache the application uses ~2.5GB of RAM
>>>>>>>>> in every server but wen we use the near cache, the application memory usage
>>>>>>>>> never stops growing.
>>>>>>>>>
>>>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>>>> application.
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> In the graph above, the version with the near cache was deployed
>>>>>>>>> on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server
>>>>>>>>> started swapping and at arround 7:45 the application crashed. This is a
>>>>>>>>> detail:
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> I would very much like to create a reproducer but it looks like it
>>>>>>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>>>>>>> application needs several hours to use all the memory and if we take into
>>>>>>>>> account that every server with the gRPC application receives around 90
>>>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>>>
>>>>>>>>> Does anybody have any idea where the problem can be or how to find
>>>>>>>>> it?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you very much
>>>>>>>>>
>>>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

Limiting the MaxSize to 10 elements make a difference, the application
stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the
neas cache through JMX. Currently it is showing a negative number:
[image: image.png]

Today I will add more memory to one of the servers and in another I'll
raise the MaxSize and MaxMemorySize gradually and will track the change in
memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <pt...@apache.org> wrote:

> > do you mean a JVM heap dump
> yes
>
> > could the memory usage come from some .NET Core unmanaged code
> Very unlikely. Near Cache is a Java-only feature (.NET-native version is
> in the works), it does not cause any extra things to happen in .NET,
> besides passing the config initially.
>
> On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:
>
>> I just deployed a modified version of the application with the near cache
>> with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously.
>> Now I'll have to wait to see if there is any change in the memory evolution.
>>
>> Regarding the heap dump you mention in the "further steps", do you mean a
>> JVM heap dump?
>>
>> An idea that is floating in my mind, could the memory usage come from
>> some .NET Core unmanaged code?
>>
>>
>> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> Near Cache stores data on the JVM heap.
>>> Unmanaged ("offheap") memory and .NET Core heap should not be affected,
>>> and that is what we see on the graphs.
>>>
>>> Now we need to understand whether there is some memory leak on JVM heap,
>>> or we are simply running out of space there.
>>> You have 500MB limit for near cache, but this is counted using only raw
>>> data size, and does not account for per-entry overhead.
>>>
>>> So further steps are:
>>> - Either make limit smaller, or increase JVM heap - see if memory usage
>>> stabilizes at some point
>>> - If the above does not work, analyze heap dump to understand what
>>> causes memory consumption
>>>
>>> Keep us posted, and thanks for detailed reply.
>>>
>>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>>>
>>>> I will try to change the MaxSize to 10 on just one of the servers
>>>> because it will have an impact to its response times. I'll send another
>>>> email when have some data after the change but it will take a few hours to
>>>> see the memory evolution.
>>>>
>>>> I have enabled remote JMX on the server I'm using to debug the issue
>>>> and I have graphs since the last time the application was started. These
>>>> are graphs from the old jconsole but I think they will be good enough for
>>>> you.
>>>>
>>>> Any difference you spot since 12:45 to 13:00 is because I removed this
>>>> particular server from the load balancer and I did a memory dump using
>>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>>> monitor` the .NET Core heap dances around 300MB
>>>>
>>>> This is the heap:
>>>> [image: image.png]
>>>> And this is the non-heap:
>>>> [image: image.png]
>>>>
>>>> I reckon that the application might need a bigger heap as the garbage
>>>> collector is executing quite often. But thats not the problem I'm trying to
>>>> fix right now.
>>>>
>>>> Just for reference, this is the memory usage of the server where that
>>>> application runs.
>>>> [image: image.png]
>>>> And this is the working set in bytes reported by the .NET Core
>>>> application, where the client node runs (the time in this graph is in UTC
>>>> while the previous ones are in the +1 time zone) and if we compare this
>>>> graph with the previous one most of the memory usage in the server comes
>>>> from this application.
>>>> [image: image.png]
>>>>
>>>> For completeness, this is the evolution of the .NET Core heap size (the
>>>> time in this graph is also in UTC while the previous ones are in the +1
>>>> time zone):
>>>> [image: image.png]
>>>>
>>>> So, just at the moment of writing the client node application has a JVM
>>>> heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the
>>>> .NET Core dances around 300MB. I have no clue about what is causing the
>>>> steadily increase in memory usage.
>>>>
>>>>
>>>> Thanks for your support.
>>>>
>>>>
>>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>>> wrote:
>>>>
>>>>> What if you reduce MaxSize to some small number, like 10, does it
>>>>> solve the problem?
>>>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>>>
>>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:
>>>>>
>>>>>> Hi Pavel,
>>>>>>
>>>>>> We have six servers, but these don't have any issue, and 40 client
>>>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>>>> true).
>>>>>>
>>>>>> The 40 client nodes are the ones where we are having the memory issue.
>>>>>>
>>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We
>>>>>> also tried to using the following code but the memory issue was the same:
>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>> {
>>>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>>>> reaches 100000 in size.
>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>> {
>>>>>> MaxSize = 5000, // 5000 elements
>>>>>> MaxMemorySize = 500000000
>>>>>> }
>>>>>> };
>>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
>>>>>> cacheName), nearCacheCfg);
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Eduard,
>>>>>>>
>>>>>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true),
>>>>>>> or just servers?
>>>>>>>
>>>>>>> Is the following line executed on Ignite server node?
>>>>>>> _ignite.GetOrCreateNearCache
>>>>>>>
>>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>>>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>>>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>>>>>> servers.
>>>>>>>>
>>>>>>>> We store de-normalized data in the Ignite data grid: one of the
>>>>>>>> .NET Core applications puts data into the cache and the other application
>>>>>>>> is a gRPC service that just reads that data to compute a response. The data
>>>>>>>> is split in a dozen of caches which are created programatically from the
>>>>>>>> application that writes into the caches.
>>>>>>>>
>>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions
>>>>>>>> have two backups.
>>>>>>>>
>>>>>>>> It's been working fine so far but we identified that one particular
>>>>>>>> cache was the most read and to reduce network usage and improve response
>>>>>>>> time of the gRPC service we decided to use a near cache. That particular
>>>>>>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>>>>>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> The embedded JVM in the gRPC .NET Core application is started with
>>>>>>>> the following parameters:
>>>>>>>> -Xmx=1024
>>>>>>>> -Xms=1024
>>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>>> -Xrs
>>>>>>>> -XX:+AlwaysPreTouch
>>>>>>>> -XX:+UseG1GC
>>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>>> -XX:+DisableExplicitGC
>>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>>
>>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>>
>>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>>> {
>>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>>> it reaches 100000 in size.
>>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>>> {
>>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>>> MaxMemorySize = 500000000
>>>>>>>> }
>>>>>>>> };
>>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>>> nearCacheCfg);
>>>>>>>>
>>>>>>>> But since we added the near cache the application memory usage
>>>>>>>> never stabilizes: without the near cache the application uses ~2.5GB of RAM
>>>>>>>> in every server but wen we use the near cache, the application memory usage
>>>>>>>> never stops growing.
>>>>>>>>
>>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>>> application.
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> In the graph above, the version with the near cache was deployed on
>>>>>>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>>>>>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> I would very much like to create a reproducer but it looks like it
>>>>>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>>>>>> application needs several hours to use all the memory and if we take into
>>>>>>>> account that every server with the gRPC application receives around 90
>>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>>
>>>>>>>> Does anybody have any idea where the problem can be or how to find
>>>>>>>> it?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you very much
>>>>>>>>
>>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Pavel Tupitsyn <pt...@apache.org>.

> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in
the works), it does not cause any extra things to happen in .NET, besides
passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <ed...@llull.net> wrote:

> I just deployed a modified version of the application with the near cache
> with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously.
> Now I'll have to wait to see if there is any change in the memory evolution.
>
> Regarding the heap dump you mention in the "further steps", do you mean a
> JVM heap dump?
>
> An idea that is floating in my mind, could the memory usage come from some
> .NET Core unmanaged code?
>
>
> On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> Near Cache stores data on the JVM heap.
>> Unmanaged ("offheap") memory and .NET Core heap should not be affected,
>> and that is what we see on the graphs.
>>
>> Now we need to understand whether there is some memory leak on JVM heap,
>> or we are simply running out of space there.
>> You have 500MB limit for near cache, but this is counted using only raw
>> data size, and does not account for per-entry overhead.
>>
>> So further steps are:
>> - Either make limit smaller, or increase JVM heap - see if memory usage
>> stabilizes at some point
>> - If the above does not work, analyze heap dump to understand what causes
>> memory consumption
>>
>> Keep us posted, and thanks for detailed reply.
>>
>> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>>
>>> I will try to change the MaxSize to 10 on just one of the servers
>>> because it will have an impact to its response times. I'll send another
>>> email when have some data after the change but it will take a few hours to
>>> see the memory evolution.
>>>
>>> I have enabled remote JMX on the server I'm using to debug the issue and
>>> I have graphs since the last time the application was started. These are
>>> graphs from the old jconsole but I think they will be good enough for you.
>>>
>>> Any difference you spot since 12:45 to 13:00 is because I removed this
>>> particular server from the load balancer and I did a memory dump using
>>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>>> monitor` the .NET Core heap dances around 300MB
>>>
>>> This is the heap:
>>> [image: image.png]
>>> And this is the non-heap:
>>> [image: image.png]
>>>
>>> I reckon that the application might need a bigger heap as the garbage
>>> collector is executing quite often. But thats not the problem I'm trying to
>>> fix right now.
>>>
>>> Just for reference, this is the memory usage of the server where that
>>> application runs.
>>> [image: image.png]
>>> And this is the working set in bytes reported by the .NET Core
>>> application, where the client node runs (the time in this graph is in UTC
>>> while the previous ones are in the +1 time zone) and if we compare this
>>> graph with the previous one most of the memory usage in the server comes
>>> from this application.
>>> [image: image.png]
>>>
>>> For completeness, this is the evolution of the .NET Core heap size (the
>>> time in this graph is also in UTC while the previous ones are in the +1
>>> time zone):
>>> [image: image.png]
>>>
>>> So, just at the moment of writing the client node application has a JVM
>>> heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the
>>> .NET Core dances around 300MB. I have no clue about what is causing the
>>> steadily increase in memory usage.
>>>
>>>
>>> Thanks for your support.
>>>
>>>
>>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>>> wrote:
>>>
>>>> What if you reduce MaxSize to some small number, like 10, does it solve
>>>> the problem?
>>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>>
>>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:
>>>>
>>>>> Hi Pavel,
>>>>>
>>>>> We have six servers, but these don't have any issue, and 40 client
>>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>>> true).
>>>>>
>>>>> The 40 client nodes are the ones where we are having the memory issue.
>>>>>
>>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We
>>>>> also tried to using the following code but the memory issue was the same:
>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>> {
>>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>>> reaches 100000 in size.
>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>> {
>>>>> MaxSize = 5000, // 5000 elements
>>>>> MaxMemorySize = 500000000
>>>>> }
>>>>> };
>>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
>>>>> cacheName), nearCacheCfg);
>>>>>
>>>>>
>>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Eduard,
>>>>>>
>>>>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true),
>>>>>> or just servers?
>>>>>>
>>>>>> Is the following line executed on Ignite server node?
>>>>>> _ignite.GetOrCreateNearCache
>>>>>>
>>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>>>>> servers.
>>>>>>>
>>>>>>> We store de-normalized data in the Ignite data grid: one of the .NET
>>>>>>> Core applications puts data into the cache and the other application is a
>>>>>>> gRPC service that just reads that data to compute a response. The data is
>>>>>>> split in a dozen of caches which are created programatically from the
>>>>>>> application that writes into the caches.
>>>>>>>
>>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have
>>>>>>> two backups.
>>>>>>>
>>>>>>> It's been working fine so far but we identified that one particular
>>>>>>> cache was the most read and to reduce network usage and improve response
>>>>>>> time of the gRPC service we decided to use a near cache. That particular
>>>>>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>>>>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>>
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> The embedded JVM in the gRPC .NET Core application is started with
>>>>>>> the following parameters:
>>>>>>> -Xmx=1024
>>>>>>> -Xms=1024
>>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>>> -Xrs
>>>>>>> -XX:+AlwaysPreTouch
>>>>>>> -XX:+UseG1GC
>>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>>> -XX:+DisableExplicitGC
>>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>>> -Dcom.sun.management.jmxremote
>>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>>
>>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>>
>>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>>> {
>>>>>>> // Use LRU eviction policy to automatically evict entries whenever
>>>>>>> it reaches 100000 in size.
>>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>>> {
>>>>>>> MaxSize = 5000, // 5000 elements
>>>>>>> MaxMemorySize = 500000000
>>>>>>> }
>>>>>>> };
>>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>>> nearCacheCfg);
>>>>>>>
>>>>>>> But since we added the near cache the application memory usage never
>>>>>>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>>>>>>> every server but wen we use the near cache, the application memory usage
>>>>>>> never stops growing.
>>>>>>>
>>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>>> application.
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> In the graph above, the version with the near cache was deployed on
>>>>>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>>>>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> I would very much like to create a reproducer but it looks like it
>>>>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>>>>> application needs several hours to use all the memory and if we take into
>>>>>>> account that every server with the gRPC application receives around 90
>>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>>
>>>>>>> Does anybody have any idea where the problem can be or how to find
>>>>>>> it?
>>>>>>>
>>>>>>>
>>>>>>> Thank you very much
>>>>>>>
>>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

I just deployed a modified version of the application with the near cache
with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously.
Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a
JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some
.NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <pt...@apache.org> wrote:

> Near Cache stores data on the JVM heap.
> Unmanaged ("offheap") memory and .NET Core heap should not be affected,
> and that is what we see on the graphs.
>
> Now we need to understand whether there is some memory leak on JVM heap,
> or we are simply running out of space there.
> You have 500MB limit for near cache, but this is counted using only raw
> data size, and does not account for per-entry overhead.
>
> So further steps are:
> - Either make limit smaller, or increase JVM heap - see if memory usage
> stabilizes at some point
> - If the above does not work, analyze heap dump to understand what causes
> memory consumption
>
> Keep us posted, and thanks for detailed reply.
>
> On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:
>
>> I will try to change the MaxSize to 10 on just one of the servers because
>> it will have an impact to its response times. I'll send another email when
>> have some data after the change but it will take a few hours to see the
>> memory evolution.
>>
>> I have enabled remote JMX on the server I'm using to debug the issue and
>> I have graphs since the last time the application was started. These are
>> graphs from the old jconsole but I think they will be good enough for you.
>>
>> Any difference you spot since 12:45 to 13:00 is because I removed this
>> particular server from the load balancer and I did a memory dump using
>> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
>> monitor` the .NET Core heap dances around 300MB
>>
>> This is the heap:
>> [image: image.png]
>> And this is the non-heap:
>> [image: image.png]
>>
>> I reckon that the application might need a bigger heap as the garbage
>> collector is executing quite often. But thats not the problem I'm trying to
>> fix right now.
>>
>> Just for reference, this is the memory usage of the server where that
>> application runs.
>> [image: image.png]
>> And this is the working set in bytes reported by the .NET Core
>> application, where the client node runs (the time in this graph is in UTC
>> while the previous ones are in the +1 time zone) and if we compare this
>> graph with the previous one most of the memory usage in the server comes
>> from this application.
>> [image: image.png]
>>
>> For completeness, this is the evolution of the .NET Core heap size (the
>> time in this graph is also in UTC while the previous ones are in the +1
>> time zone):
>> [image: image.png]
>>
>> So, just at the moment of writing the client node application has a JVM
>> heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the
>> .NET Core dances around 300MB. I have no clue about what is causing the
>> steadily increase in memory usage.
>>
>>
>> Thanks for your support.
>>
>>
>> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> What if you reduce MaxSize to some small number, like 10, does it solve
>>> the problem?
>>> Can you please run jvisualvm and see what happens with the JVM heap?
>>>
>>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:
>>>
>>>> Hi Pavel,
>>>>
>>>> We have six servers, but these don't have any issue, and 40 client
>>>> nodes (the Igntie node is started with IgniteConfiguration.ClientMode =
>>>> true).
>>>>
>>>> The 40 client nodes are the ones where we are having the memory issue.
>>>>
>>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We
>>>> also tried to using the following code but the memory issue was the same:
>>>> var nearCacheCfg = new NearCacheConfiguration
>>>> {
>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>> reaches 100000 in size.
>>>> EvictionPolicy = new LruEvictionPolicy
>>>> {
>>>> MaxSize = 5000, // 5000 elements
>>>> MaxMemorySize = 500000000
>>>> }
>>>> };
>>>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
>>>> cacheName), nearCacheCfg);
>>>>
>>>>
>>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Eduard,
>>>>>
>>>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true), or
>>>>> just servers?
>>>>>
>>>>> Is the following line executed on Ignite server node?
>>>>> _ignite.GetOrCreateNearCache
>>>>>
>>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>>>> servers.
>>>>>>
>>>>>> We store de-normalized data in the Ignite data grid: one of the .NET
>>>>>> Core applications puts data into the cache and the other application is a
>>>>>> gRPC service that just reads that data to compute a response. The data is
>>>>>> split in a dozen of caches which are created programatically from the
>>>>>> application that writes into the caches.
>>>>>>
>>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have
>>>>>> two backups.
>>>>>>
>>>>>> It's been working fine so far but we identified that one particular
>>>>>> cache was the most read and to reduce network usage and improve response
>>>>>> time of the gRPC service we decided to use a near cache. That particular
>>>>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>>>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> The embedded JVM in the gRPC .NET Core application is started with
>>>>>> the following parameters:
>>>>>> -Xmx=1024
>>>>>> -Xms=1024
>>>>>> -Djava.net.preferIPv4Stack=true
>>>>>> -Xrs
>>>>>> -XX:+AlwaysPreTouch
>>>>>> -XX:+UseG1GC
>>>>>> -XX:+ScavengeBeforeFullGC
>>>>>> -XX:+DisableExplicitGC
>>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>>> -Dcom.sun.management.jmxremote
>>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>>
>>>>>> If we don't use the near cache, at every gRPC call the server
>>>>>> receives it executes the following code to get the cache (this works fine):
>>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>>
>>>>>> And if we want to use the near cache, that line is changed to:
>>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>>> {
>>>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>>>> reaches 100000 in size.
>>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>>> {
>>>>>> MaxSize = 5000, // 5000 elements
>>>>>> MaxMemorySize = 500000000
>>>>>> }
>>>>>> };
>>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>>> nearCacheCfg);
>>>>>>
>>>>>> But since we added the near cache the application memory usage never
>>>>>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>>>>>> every server but wen we use the near cache, the application memory usage
>>>>>> never stops growing.
>>>>>>
>>>>>> This is the memory usage of one of the servers with the gRPC
>>>>>> application.
>>>>>> [image: image.png]
>>>>>>
>>>>>> In the graph above, the version with the near cache was deployed on
>>>>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>>>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>>>>> [image: image.png]
>>>>>>
>>>>>> I would very much like to create a reproducer but it looks like it
>>>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>>>> application needs several hours to use all the memory and if we take into
>>>>>> account that every server with the gRPC application receives around 90
>>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>>
>>>>>> Does anybody have any idea where the problem can be or how to find it?
>>>>>>
>>>>>>
>>>>>> Thank you very much
>>>>>>
>>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Pavel Tupitsyn <pt...@apache.org>.

Near Cache stores data on the JVM heap.
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and
that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or
we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw
data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage
stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes
memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <ed...@llull.net> wrote:

> I will try to change the MaxSize to 10 on just one of the servers because
> it will have an impact to its response times. I'll send another email when
> have some data after the change but it will take a few hours to see the
> memory evolution.
>
> I have enabled remote JMX on the server I'm using to debug the issue and I
> have graphs since the last time the application was started. These are
> graphs from the old jconsole but I think they will be good enough for you.
>
> Any difference you spot since 12:45 to 13:00 is because I removed this
> particular server from the load balancer and I did a memory dump using
> dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters
> monitor` the .NET Core heap dances around 300MB
>
> This is the heap:
> [image: image.png]
> And this is the non-heap:
> [image: image.png]
>
> I reckon that the application might need a bigger heap as the garbage
> collector is executing quite often. But thats not the problem I'm trying to
> fix right now.
>
> Just for reference, this is the memory usage of the server where that
> application runs.
> [image: image.png]
> And this is the working set in bytes reported by the .NET Core
> application, where the client node runs (the time in this graph is in UTC
> while the previous ones are in the +1 time zone) and if we compare this
> graph with the previous one most of the memory usage in the server comes
> from this application.
> [image: image.png]
>
> For completeness, this is the evolution of the .NET Core heap size (the
> time in this graph is also in UTC while the previous ones are in the +1
> time zone):
> [image: image.png]
>
> So, just at the moment of writing the client node application has a JVM
> heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the
> .NET Core dances around 300MB. I have no clue about what is causing the
> steadily increase in memory usage.
>
>
> Thanks for your support.
>
>
> On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> What if you reduce MaxSize to some small number, like 10, does it solve
>> the problem?
>> Can you please run jvisualvm and see what happens with the JVM heap?
>>
>> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:
>>
>>> Hi Pavel,
>>>
>>> We have six servers, but these don't have any issue, and 40 client nodes
>>> (the Igntie node is started with IgniteConfiguration.ClientMode = true).
>>>
>>> The 40 client nodes are the ones where we are having the memory issue.
>>>
>>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We also
>>> tried to using the following code but the memory issue was the same:
>>> var nearCacheCfg = new NearCacheConfiguration
>>> {
>>> // Use LRU eviction policy to automatically evict entries whenever it
>>> reaches 100000 in size.
>>> EvictionPolicy = new LruEvictionPolicy
>>> {
>>> MaxSize = 5000, // 5000 elements
>>> MaxMemorySize = 500000000
>>> }
>>> };
>>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
>>> cacheName), nearCacheCfg);
>>>
>>>
>>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>>> wrote:
>>>
>>>> Hi Eduard,
>>>>
>>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true), or
>>>> just servers?
>>>>
>>>> Is the following line executed on Ignite server node?
>>>> _ignite.GetOrCreateNearCache
>>>>
>>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>>> servers.
>>>>>
>>>>> We store de-normalized data in the Ignite data grid: one of the .NET
>>>>> Core applications puts data into the cache and the other application is a
>>>>> gRPC service that just reads that data to compute a response. The data is
>>>>> split in a dozen of caches which are created programatically from the
>>>>> application that writes into the caches.
>>>>>
>>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have
>>>>> two backups.
>>>>>
>>>>> It's been working fine so far but we identified that one particular
>>>>> cache was the most read and to reduce network usage and improve response
>>>>> time of the gRPC service we decided to use a near cache. That particular
>>>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> The embedded JVM in the gRPC .NET Core application is started with the
>>>>> following parameters:
>>>>> -Xmx=1024
>>>>> -Xms=1024
>>>>> -Djava.net.preferIPv4Stack=true
>>>>> -Xrs
>>>>> -XX:+AlwaysPreTouch
>>>>> -XX:+UseG1GC
>>>>> -XX:+ScavengeBeforeFullGC
>>>>> -XX:+DisableExplicitGC
>>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>>> -Dcom.sun.management.jmxremote
>>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>>
>>>>> If we don't use the near cache, at every gRPC call the server receives
>>>>> it executes the following code to get the cache (this works fine):
>>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>>
>>>>> And if we want to use the near cache, that line is changed to:
>>>>> var nearCacheCfg = new NearCacheConfiguration
>>>>> {
>>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>>> reaches 100000 in size.
>>>>> EvictionPolicy = new LruEvictionPolicy
>>>>> {
>>>>> MaxSize = 5000, // 5000 elements
>>>>> MaxMemorySize = 500000000
>>>>> }
>>>>> };
>>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>>> nearCacheCfg);
>>>>>
>>>>> But since we added the near cache the application memory usage never
>>>>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>>>>> every server but wen we use the near cache, the application memory usage
>>>>> never stops growing.
>>>>>
>>>>> This is the memory usage of one of the servers with the gRPC
>>>>> application.
>>>>> [image: image.png]
>>>>>
>>>>> In the graph above, the version with the near cache was deployed on
>>>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>>>> [image: image.png]
>>>>>
>>>>> I would very much like to create a reproducer but it looks like it
>>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>>> application needs several hours to use all the memory and if we take into
>>>>> account that every server with the gRPC application receives around 90
>>>>> requests per second, if the memory leak exists it is very slow.
>>>>>
>>>>> Does anybody have any idea where the problem can be or how to find it?
>>>>>
>>>>>
>>>>> Thank you very much
>>>>>
>>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

I will try to change the MaxSize to 10 on just one of the servers because
it will have an impact to its response times. I'll send another email when
have some data after the change but it will take a few hours to see the
memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I
have graphs since the last time the application was started. These are
graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this
particular server from the load balancer and I did a memory dump using
dotnet-dump but I'm not able to find anything in the dump. In fact
with `dotnet-counters
monitor` the .NET Core heap dances around 300MB

This is the heap:
[image: image.png]
And this is the non-heap:
[image: image.png]

I reckon that the application might need a bigger heap as the garbage
collector is executing quite often. But thats not the problem I'm trying to
fix right now.

Just for reference, this is the memory usage of the server where that
application runs.
[image: image.png]
And this is the working set in bytes reported by the .NET Core application,
where the client node runs (the time in this graph is in UTC while the
previous ones are in the +1 time zone) and if we compare this graph with
the previous one most of the memory usage in the server comes from this
application.
[image: image.png]

For completeness, this is the evolution of the .NET Core heap size (the
time in this graph is also in UTC while the previous ones are in the +1
time zone):
[image: image.png]

So, just at the moment of writing the client node application has a JVM
heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the
.NET Core dances around 300MB. I have no clue about what is causing the
steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <pt...@apache.org> wrote:

> What if you reduce MaxSize to some small number, like 10, does it solve
> the problem?
> Can you please run jvisualvm and see what happens with the JVM heap?
>
> On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:
>
>> Hi Pavel,
>>
>> We have six servers, but these don't have any issue, and 40 client nodes
>> (the Igntie node is started with IgniteConfiguration.ClientMode = true).
>>
>> The 40 client nodes are the ones where we are having the memory issue.
>>
>> The _ignite.GetOrCreateNearCache is execute on the client nodes. We also
>> tried to using the following code but the memory issue was the same:
>> var nearCacheCfg = new NearCacheConfiguration
>> {
>> // Use LRU eviction policy to automatically evict entries whenever it
>> reaches 100000 in size.
>> EvictionPolicy = new LruEvictionPolicy
>> {
>> MaxSize = 5000, // 5000 elements
>> MaxMemorySize = 500000000
>> }
>> };
>> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
>> cacheName), nearCacheCfg);
>>
>>
>> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> Hi Eduard,
>>>
>>> Do you have any client nodes (IgniteConfiguration.ClientMode=true), or
>>> just servers?
>>>
>>> Is the following line executed on Ignite server node?
>>> _ignite.GetOrCreateNearCache
>>>
>>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> We have been using Ignite and Ignite.NET in the recent months in a
>>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>>> servers.
>>>>
>>>> We store de-normalized data in the Ignite data grid: one of the .NET
>>>> Core applications puts data into the cache and the other application is a
>>>> gRPC service that just reads that data to compute a response. The data is
>>>> split in a dozen of caches which are created programatically from the
>>>> application that writes into the caches.
>>>>
>>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have
>>>> two backups.
>>>>
>>>> It's been working fine so far but we identified that one particular
>>>> cache was the most read and to reduce network usage and improve response
>>>> time of the gRPC service we decided to use a near cache. That particular
>>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>>
>>>> [image: image.png]
>>>>
>>>> The embedded JVM in the gRPC .NET Core application is started with the
>>>> following parameters:
>>>> -Xmx=1024
>>>> -Xms=1024
>>>> -Djava.net.preferIPv4Stack=true
>>>> -Xrs
>>>> -XX:+AlwaysPreTouch
>>>> -XX:+UseG1GC
>>>> -XX:+ScavengeBeforeFullGC
>>>> -XX:+DisableExplicitGC
>>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>>> -Dcom.sun.management.jmxremote
>>>> -Dcom.sun.management.jmxremote.ssl=false
>>>> -Dcom.sun.management.jmxremote.authenticate=false
>>>> -Dcom.sun.management.jmxremote.local.only=false
>>>> -Dcom.sun.management.jmxremote.port=12345
>>>>
>>>> If we don't use the near cache, at every gRPC call the server receives
>>>> it executes the following code to get the cache (this works fine):
>>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>>
>>>> And if we want to use the near cache, that line is changed to:
>>>> var nearCacheCfg = new NearCacheConfiguration
>>>> {
>>>> // Use LRU eviction policy to automatically evict entries whenever it
>>>> reaches 100000 in size.
>>>> EvictionPolicy = new LruEvictionPolicy
>>>> {
>>>> MaxSize = 5000, // 5000 elements
>>>> MaxMemorySize = 500000000
>>>> }
>>>> };
>>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>>> nearCacheCfg);
>>>>
>>>> But since we added the near cache the application memory usage never
>>>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>>>> every server but wen we use the near cache, the application memory usage
>>>> never stops growing.
>>>>
>>>> This is the memory usage of one of the servers with the gRPC
>>>> application.
>>>> [image: image.png]
>>>>
>>>> In the graph above, the version with the near cache was deployed on
>>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>>> [image: image.png]
>>>>
>>>> I would very much like to create a reproducer but it looks like it
>>>> would take a very long time to execute the reproduce the issue as the gRPC
>>>> application needs several hours to use all the memory and if we take into
>>>> account that every server with the gRPC application receives around 90
>>>> requests per second, if the memory leak exists it is very slow.
>>>>
>>>> Does anybody have any idea where the problem can be or how to find it?
>>>>
>>>>
>>>> Thank you very much
>>>>
>>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Pavel Tupitsyn <pt...@apache.org>.

What if you reduce MaxSize to some small number, like 10, does it solve the
problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <ed...@llull.net> wrote:

> Hi Pavel,
>
> We have six servers, but these don't have any issue, and 40 client nodes
> (the Igntie node is started with IgniteConfiguration.ClientMode = true).
>
> The 40 client nodes are the ones where we are having the memory issue.
>
> The _ignite.GetOrCreateNearCache is execute on the client nodes. We also
> tried to using the following code but the memory issue was the same:
> var nearCacheCfg = new NearCacheConfiguration
> {
> // Use LRU eviction policy to automatically evict entries whenever it
> reaches 100000 in size.
> EvictionPolicy = new LruEvictionPolicy
> {
> MaxSize = 5000, // 5000 elements
> MaxMemorySize = 500000000
> }
> };
> return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
> cacheName), nearCacheCfg);
>
>
> On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> Hi Eduard,
>>
>> Do you have any client nodes (IgniteConfiguration.ClientMode=true), or
>> just servers?
>>
>> Is the following line executed on Ignite server node?
>> _ignite.GetOrCreateNearCache
>>
>> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:
>>
>>> Hi everyone,
>>>
>>> We have been using Ignite and Ignite.NET in the recent months in a
>>> project. We currently have six Ignite servers (started with ignite.sh) and
>>> a bunch of thick clients split in two .NET Core application deployed in 30
>>> servers.
>>>
>>> We store de-normalized data in the Ignite data grid: one of the .NET
>>> Core applications puts data into the cache and the other application is a
>>> gRPC service that just reads that data to compute a response. The data is
>>> split in a dozen of caches which are created programatically from the
>>> application that writes into the caches.
>>>
>>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have two
>>> backups.
>>>
>>> It's been working fine so far but we identified that one particular
>>> cache was the most read and to reduce network usage and improve response
>>> time of the gRPC service we decided to use a near cache. That particular
>>> cache has ~2300 entries which occupies ~110MB of space and the near cache
>>> is configured with a maxSize=5000 and maxMemorySize=500000000
>>>
>>> [image: image.png]
>>>
>>> The embedded JVM in the gRPC .NET Core application is started with the
>>> following parameters:
>>> -Xmx=1024
>>> -Xms=1024
>>> -Djava.net.preferIPv4Stack=true
>>> -Xrs
>>> -XX:+AlwaysPreTouch
>>> -XX:+UseG1GC
>>> -XX:+ScavengeBeforeFullGC
>>> -XX:+DisableExplicitGC
>>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>>> -Dcom.sun.management.jmxremote
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false
>>> -Dcom.sun.management.jmxremote.local.only=false
>>> -Dcom.sun.management.jmxremote.port=12345
>>>
>>> If we don't use the near cache, at every gRPC call the server receives
>>> it executes the following code to get the cache (this works fine):
>>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>>
>>> And if we want to use the near cache, that line is changed to:
>>> var nearCacheCfg = new NearCacheConfiguration
>>> {
>>> // Use LRU eviction policy to automatically evict entries whenever it
>>> reaches 100000 in size.
>>> EvictionPolicy = new LruEvictionPolicy
>>> {
>>> MaxSize = 5000, // 5000 elements
>>> MaxMemorySize = 500000000
>>> }
>>> };
>>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName,
>>> nearCacheCfg);
>>>
>>> But since we added the near cache the application memory usage never
>>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>>> every server but wen we use the near cache, the application memory usage
>>> never stops growing.
>>>
>>> This is the memory usage of one of the servers with the gRPC application.
>>> [image: image.png]
>>>
>>> In the graph above, the version with the near cache was deployed on
>>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>>> swapping and at arround 7:45 the application crashed. This is a detail:
>>> [image: image.png]
>>>
>>> I would very much like to create a reproducer but it looks like it would
>>> take a very long time to execute the reproduce the issue as the gRPC
>>> application needs several hours to use all the memory and if we take into
>>> account that every server with the gRPC application receives around 90
>>> requests per second, if the memory leak exists it is very slow.
>>>
>>> Does anybody have any idea where the problem can be or how to find it?
>>>
>>>
>>> Thank you very much
>>>
>>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Eduard Llull <ed...@llull.net>.

Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes
(the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also
tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it
reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(
cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <pt...@apache.org> wrote:

> Hi Eduard,
>
> Do you have any client nodes (IgniteConfiguration.ClientMode=true), or
> just servers?
>
> Is the following line executed on Ignite server node?
> _ignite.GetOrCreateNearCache
>
> On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:
>
>> Hi everyone,
>>
>> We have been using Ignite and Ignite.NET in the recent months in a
>> project. We currently have six Ignite servers (started with ignite.sh) and
>> a bunch of thick clients split in two .NET Core application deployed in 30
>> servers.
>>
>> We store de-normalized data in the Ignite data grid: one of the .NET Core
>> applications puts data into the cache and the other application is a gRPC
>> service that just reads that data to compute a response. The data is split
>> in a dozen of caches which are created programatically from the application
>> that writes into the caches.
>>
>> The caches are PARTITIONED and TRANSACTIONAL and the partitions have two
>> backups.
>>
>> It's been working fine so far but we identified that one particular cache
>> was the most read and to reduce network usage and improve response time of
>> the gRPC service we decided to use a near cache. That particular cache has
>> ~2300 entries which occupies ~110MB of space and the near cache is
>> configured with a maxSize=5000 and maxMemorySize=500000000
>>
>> [image: image.png]
>>
>> The embedded JVM in the gRPC .NET Core application is started with the
>> following parameters:
>> -Xmx=1024
>> -Xms=1024
>> -Djava.net.preferIPv4Stack=true
>> -Xrs
>> -XX:+AlwaysPreTouch
>> -XX:+UseG1GC
>> -XX:+ScavengeBeforeFullGC
>> -XX:+DisableExplicitGC
>> -DIGNITE_NO_SHUTDOWN_HOOK=true
>> -Dcom.sun.management.jmxremote
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.local.only=false
>> -Dcom.sun.management.jmxremote.port=12345
>>
>> If we don't use the near cache, at every gRPC call the server receives it
>> executes the following code to get the cache (this works fine):
>> return _ignite.GetCache<TKey, TValue>(cacheName);
>>
>> And if we want to use the near cache, that line is changed to:
>> var nearCacheCfg = new NearCacheConfiguration
>> {
>> // Use LRU eviction policy to automatically evict entries whenever it
>> reaches 100000 in size.
>> EvictionPolicy = new LruEvictionPolicy
>> {
>> MaxSize = 5000, // 5000 elements
>> MaxMemorySize = 500000000
>> }
>> };
>> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg
>> );
>>
>> But since we added the near cache the application memory usage never
>> stabilizes: without the near cache the application uses ~2.5GB of RAM in
>> every server but wen we use the near cache, the application memory usage
>> never stops growing.
>>
>> This is the memory usage of one of the servers with the gRPC application.
>> [image: image.png]
>>
>> In the graph above, the version with the near cache was deployed on
>> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
>> swapping and at arround 7:45 the application crashed. This is a detail:
>> [image: image.png]
>>
>> I would very much like to create a reproducer but it looks like it would
>> take a very long time to execute the reproduce the issue as the gRPC
>> application needs several hours to use all the memory and if we take into
>> account that every server with the gRPC application receives around 90
>> requests per second, if the memory leak exists it is very slow.
>>
>> Does anybody have any idea where the problem can be or how to find it?
>>
>>
>> Thank you very much
>>
>>

Re: Possible memory leak when using a near cache in Ignite.NET?

Posted by Pavel Tupitsyn <pt...@apache.org>.

Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just
servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <ed...@llull.net> wrote:

> Hi everyone,
>
> We have been using Ignite and Ignite.NET in the recent months in a
> project. We currently have six Ignite servers (started with ignite.sh) and
> a bunch of thick clients split in two .NET Core application deployed in 30
> servers.
>
> We store de-normalized data in the Ignite data grid: one of the .NET Core
> applications puts data into the cache and the other application is a gRPC
> service that just reads that data to compute a response. The data is split
> in a dozen of caches which are created programatically from the application
> that writes into the caches.
>
> The caches are PARTITIONED and TRANSACTIONAL and the partitions have two
> backups.
>
> It's been working fine so far but we identified that one particular cache
> was the most read and to reduce network usage and improve response time of
> the gRPC service we decided to use a near cache. That particular cache has
> ~2300 entries which occupies ~110MB of space and the near cache is
> configured with a maxSize=5000 and maxMemorySize=500000000
>
> [image: image.png]
>
> The embedded JVM in the gRPC .NET Core application is started with the
> following parameters:
> -Xmx=1024
> -Xms=1024
> -Djava.net.preferIPv4Stack=true
> -Xrs
> -XX:+AlwaysPreTouch
> -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC
> -XX:+DisableExplicitGC
> -DIGNITE_NO_SHUTDOWN_HOOK=true
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.port=12345
>
> If we don't use the near cache, at every gRPC call the server receives it
> executes the following code to get the cache (this works fine):
> return _ignite.GetCache<TKey, TValue>(cacheName);
>
> And if we want to use the near cache, that line is changed to:
> var nearCacheCfg = new NearCacheConfiguration
> {
> // Use LRU eviction policy to automatically evict entries whenever it
> reaches 100000 in size.
> EvictionPolicy = new LruEvictionPolicy
> {
> MaxSize = 5000, // 5000 elements
> MaxMemorySize = 500000000
> }
> };
> return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg
> );
>
> But since we added the near cache the application memory usage never
> stabilizes: without the near cache the application uses ~2.5GB of RAM in
> every server but wen we use the near cache, the application memory usage
> never stops growing.
>
> This is the memory usage of one of the servers with the gRPC application.
> [image: image.png]
>
> In the graph above, the version with the near cache was deployed on
> February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started
> swapping and at arround 7:45 the application crashed. This is a detail:
> [image: image.png]
>
> I would very much like to create a reproducer but it looks like it would
> take a very long time to execute the reproduce the issue as the gRPC
> application needs several hours to use all the memory and if we take into
> account that every server with the gRPC application receives around 90
> requests per second, if the memory leak exists it is very slow.
>
> Does anybody have any idea where the problem can be or how to find it?
>
>
> Thank you very much
>
>