You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Mahesh Renduchintala <ma...@aline-consulting.com> on 2019/07/29 13:35:47 UTC

Ignite 2.7.0: Ignite client:: memory leak

Hi,


We were suspecting a memory leak in our cluster coming on ignite thick clients.

We performed the following test


A simple program (referred to as program A below) does the following


a) connects a ignite thick client (XMS, XMX about 1G) to the cluster

b) wakeup every 30 secs and sleep in an infinite loop


The cluster also has plenty of other thick clients connecting/disconnecting all the time.


Program A shows out of memory error after 2 hours.

Attached are the logs.

Attached are config files.


regards

Mahesh



Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
We observed the same memory leak issue on 2.7.6 as well. Deleting tables continuously from a thick client causes out of memory exceptions in other thick clients. Can you please recheck? We badly need a workaround for this issue.

This is probably related to another issue that is discussed on another thread
http://apache-ignite-users.70518.x6.nabble.com/Nodes-are-restarting-when-i-try-to-drop-a-table-created-with-persistence-enabled-td27897.html


Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
Dennis, Thanks for the update. Will wait for the release.

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Denis Magda <dm...@apache.org>.
As an update, this issue has been included into Ignite 2.7.6 release
scheduled for this month.

-
Denis


On Tue, Aug 6, 2019 at 12:37 PM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> We did not change it since it should never be a problem anymore. The
> rationale is that I can't prove that some history size value will work and
> cause no further issues.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 6 авг. 2019 г. в 22:25, Denis Magda <dm...@apache.org>:
>
>> Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0
>> for the clients? If haven't what is a rationale?
>>
>> -
>> Denis
>>
>>
>> On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <
>> aealexsandrov@gmail.com> wrote:
>>
>>> Hi Mahesh,
>>>
>>> Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
>>> stored the data for the last 1000 exchanges.
>>>
>>> It generally can be required for the case when the coordinator was
>>> changed and new coordinator required to load last exchange history.
>>>
>>> Exist two problems here:
>>>
>>> 1)Client nodes can't be a coordinator. So there is no reason to store
>>> 1000 entries there. Will be better to set this option to some small
>>> value or zero for client nodes.
>>> 2)Server nodes also don't require 1000 entries. The required number of
>>> exchange history can depend on the number of server nodes. I suggest
>>> change the default value to small value.
>>>
>>> Here is the ticket related to this problem:
>>>
>>> https://issues.apache.org/jira/browse/IGNITE-11767
>>>
>>> It fixed and should be available in Ignite 2.8 where these exchanges
>>> will take less memory.
>>>
>>> BR,
>>> Andrei
>>>
>>> On 2019/08/03 01:09:19, Mahesh Renduchintala <m....@aline-consulting.com>
>>>
>>> wrote:
>>>  > The clients we use have memory ranging from 4GB to 8GB. OOM was
>>> produced on all these clients....some sooner, some little later, bit
>>> always was seen.>
>>>  >
>>>  > The workaround is still stable for more than 48 hours now.>
>>>  >
>>>  >
>>>
>>

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

We did not change it since it should never be a problem anymore. The
rationale is that I can't prove that some history size value will work and
cause no further issues.

Regards,
-- 
Ilya Kasnacheev


вт, 6 авг. 2019 г. в 22:25, Denis Magda <dm...@apache.org>:

> Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0
> for the clients? If haven't what is a rationale?
>
> -
> Denis
>
>
> On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <ae...@gmail.com>
> wrote:
>
>> Hi Mahesh,
>>
>> Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
>> stored the data for the last 1000 exchanges.
>>
>> It generally can be required for the case when the coordinator was
>> changed and new coordinator required to load last exchange history.
>>
>> Exist two problems here:
>>
>> 1)Client nodes can't be a coordinator. So there is no reason to store
>> 1000 entries there. Will be better to set this option to some small
>> value or zero for client nodes.
>> 2)Server nodes also don't require 1000 entries. The required number of
>> exchange history can depend on the number of server nodes. I suggest
>> change the default value to small value.
>>
>> Here is the ticket related to this problem:
>>
>> https://issues.apache.org/jira/browse/IGNITE-11767
>>
>> It fixed and should be available in Ignite 2.8 where these exchanges
>> will take less memory.
>>
>> BR,
>> Andrei
>>
>> On 2019/08/03 01:09:19, Mahesh Renduchintala <m....@aline-consulting.com>
>> wrote:
>>  > The clients we use have memory ranging from 4GB to 8GB. OOM was
>> produced on all these clients....some sooner, some little later, bit
>> always was seen.>
>>  >
>>  > The workaround is still stable for more than 48 hours now.>
>>  >
>>  >
>>
>

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Denis Magda <dm...@apache.org>.
Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0
for the clients? If haven't what is a rationale?

-
Denis


On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <ae...@gmail.com>
wrote:

> Hi Mahesh,
>
> Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
> stored the data for the last 1000 exchanges.
>
> It generally can be required for the case when the coordinator was
> changed and new coordinator required to load last exchange history.
>
> Exist two problems here:
>
> 1)Client nodes can't be a coordinator. So there is no reason to store
> 1000 entries there. Will be better to set this option to some small
> value or zero for client nodes.
> 2)Server nodes also don't require 1000 entries. The required number of
> exchange history can depend on the number of server nodes. I suggest
> change the default value to small value.
>
> Here is the ticket related to this problem:
>
> https://issues.apache.org/jira/browse/IGNITE-11767
>
> It fixed and should be available in Ignite 2.8 where these exchanges
> will take less memory.
>
> BR,
> Andrei
>
> On 2019/08/03 01:09:19, Mahesh Renduchintala <m....@aline-consulting.com>
> wrote:
>  > The clients we use have memory ranging from 4GB to 8GB. OOM was
> produced on all these clients....some sooner, some little later, bit
> always was seen.>
>  >
>  > The workaround is still stable for more than 48 hours now.>
>  >
>  >
>

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Andrei Aleksandrov <ae...@gmail.com>.
Hi Mahesh,

Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite 
stored the data for the last 1000 exchanges.

It generally can be required for the case when the coordinator was 
changed and new coordinator required to load last exchange history.

Exist two problems here:

1)Client nodes can't be a coordinator. So there is no reason to store 
1000 entries there. Will be better to set this option to some small 
value or zero for client nodes.
2)Server nodes also don't require 1000 entries. The required number of 
exchange history can depend on the number of server nodes. I suggest 
change the default value to small value.

Here is the ticket related to this problem:

https://issues.apache.org/jira/browse/IGNITE-11767

It fixed and should be available in Ignite 2.8 where these exchanges 
will take less memory.

BR,
Andrei

On 2019/08/03 01:09:19, Mahesh Renduchintala <m....@aline-consulting.com> 
wrote:
 > The clients we use have memory ranging from 4GB to 8GB. OOM was 
produced on all these clients....some sooner, some little later, bit 
always was seen.>
 >
 > The workaround is still stable for more than 48 hours now.>
 >
 >

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
The clients we use have memory ranging from 4GB to 8GB. OOM was produced on all these clients....some sooner, some little later, bit always was seen.

The workaround is still stable for more than 48 hours now.


Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Denis Magda <dm...@apache.org>.
Mahesh,

Hmm, yes, I remember one use case when the OOM was addressed by tweaking
IGNITE_EXCHANGE_HISTORY_SIZE. In that scenario, the environment had not
much RAM and the team couldn't allocate enough RAM for Java Heap.

How about your scenario? Have you tried to give 2GB or more to the app?

-
Denis


On Thu, Aug 1, 2019 at 8:16 PM Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com> wrote:

> Denis,
> Thanks. Meanwhile, we made some progress...the via workaround seems to use
> these two flags in the thick client config
> -DIGNITE_SKIP_PARTITION_SIZE_VALIDATION -DIGNITE_EXCHANGE_HISTORY_SIZE=8
> So far, we haven't seen clients going OOM for about 24 hours (still
> watching)
> Based on the logs, you could see there is a partition map exchanged among
> the clients and feels like messages continuously get stored on the clients
> causing this problem.
>
>

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
Denis,
Thanks. Meanwhile, we made some progress...the via workaround seems to use these two flags in the thick client config
-DIGNITE_SKIP_PARTITION_SIZE_VALIDATION -DIGNITE_EXCHANGE_HISTORY_SIZE=8
So far, we haven't seen clients going OOM for about 24 hours (still watching)
Based on the logs, you could see there is a partition map exchanged among the clients and feels like messages continuously get stored on the clients causing this problem.


Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Denis Magda <dm...@apache.org>.
Manesh,

Start all the nodes with the following parameters:
https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-getting-heap-dump-on-out-of-memory-errors

JVM will create a heap dump on failure and you'll be able to see the root
cause of the leak. If that's in Ignite then please file a BLOCKER ticket.


-
Denis


On Tue, Jul 30, 2019 at 7:45 AM Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com> wrote:

> Infact, in the logs you can see that whenever the below print comes up,
> memory jumps up by 100-200MB
>
> >>Full map updating for 873 groups performed in 16 ms
>
>
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=4c8b23b4, uptime=00:19:13.959]
>     ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
>     ^-- CPU [cur=8%, avg=7.55%, GC=0%]
>     ^-- PageMemory [pages=0]
> *    ^-- Heap [used=1307MB, free=35.9%, comm=2039MB]*
>     ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=4, qSize=0]
>     ^-- System thread pool [active=0, idle=2, qSize=0]
> 2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, will
> finish exchange [node=9e2951dc-e8ad-44e6-9495-83b0e5337511,
> resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
> 2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, need
> merge [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
> 2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167]
> .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish
> [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> mergedFut=AffinityTopologyVersion [topVer=1358, minorTopVer=0],
> evt=NODE_JOINED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2,
> evtNodeClient=false]
> 2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167]
> .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish
> [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> mergedFut=AffinityTopologyVersion [topVer=1359, minorTopVer=0],
> evt=NODE_FAILED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5,
> evtNodeClient=false]
> 2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167]
> .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish
> [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> mergedFut=AffinityTopologyVersion [topVer=1360, minorTopVer=0],
> evt=NODE_JOINED, evtNode=9c318eb2-dd21-457c-8d1f-e6d4677e1a55,
> evtNodeClient=true]
> 2019-07-30 14:11:48.486  INFO 26 --- [       sys-#167]
> .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish
> [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> mergedFut=AffinityTopologyVersion [topVer=1361, minorTopVer=0],
> evt=NODE_FAILED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2,
> evtNodeClient=false]
> 2019-07-30 14:11:48.861  INFO 26 --- [       sys-#167]
> o.a.i.i.p.c.CacheAffinitySharedManager   : Affinity applying from full
> message performed in 375 ms.
> 2019-07-30 14:11:48.864  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Affinity changes applied in 379
> ms.
> 2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Full map updating for 873 groups
> performed in 16 ms.
> 2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Finish exchange future
> [startVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0],
> resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], err=null]
> 2019-07-30 14:11:48.927  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Detecting lost partitions
> performed in 47 ms.
> 2019-07-30 14:11:49.280  INFO 26 --- [       sys-#167]
> .c.d.d.p.GridDhtPartitionsExchangeFuture : Completed partition exchange
> [localNode=4c8b23b4-ce12-4dbb-a7ea-9279711f4008,
> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
> [topVer=1357, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
> [id=20eef25d-b7ec-4340-9da8-1a5a35678ba5, addrs=[0:0:0:0:0:0:0:1%lo,
> 127.0.0.1, 192.168.1.139, 192.168.1.181], sockAddrs=[/192.168.1.181:47500,
> /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.1.139:47500],
> discPort=47500, order=1357, intOrder=696, lastExchangeTime=1564495322589,
> loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true],
> topVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0],
> durationFromInit=4411]
> 2019-07-30 14:11:49.289  INFO 26 --- [ange-worker-#43]
> .i.p.c.GridCachePartitionExchangeManager : Skipping rebalancing (no
> affinity changes) [top=AffinityTopologyVersion [topVer=1361,
> minorTopVer=0], rebTopVer=AffinityTopologyVersion [topVer=-1,
> minorTopVer=0], evt=NODE_JOINED,
> evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, client=true]
> 2019-07-30 14:11:50.127  INFO 26 --- [eout-worker-#23]
> org.apache.ignite.internal.IgniteKernal  :
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=4c8b23b4, uptime=00:19:16.964]
>     ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
>     ^-- CPU [cur=50.33%, avg=7.59%, GC=22.8%]
>     ^-- PageMemory [pages=0]
> *    ^-- Heap [used=1537MB, free=24.64%, comm=2039MB]*
>     ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=4, qSize=0]
>
>

Re: Ignite 2.7.0: Ignite client:: memory leak

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.
Infact, in the logs you can see that whenever the below print comes up, memory jumps up by 100-200MB

>>Full map updating for 873 groups performed in 16 ms


Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:13.959]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=8%, avg=7.55%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1307MB, free=35.9%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]
    ^-- System thread pool [active=0, idle=2, qSize=0]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, will finish exchange [node=9e2951dc-e8ad-44e6-9495-83b0e5337511, resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, need merge [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1358, minorTopVer=0], evt=NODE_JOINED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1359, minorTopVer=0], evt=NODE_FAILED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1360, minorTopVer=0], evt=NODE_JOINED, evtNode=9c318eb2-dd21-457c-8d1f-e6d4677e1a55, evtNodeClient=true]
2019-07-30 14:11:48.486  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1361, minorTopVer=0], evt=NODE_FAILED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.861  INFO 26 --- [       sys-#167] o.a.i.i.p.c.CacheAffinitySharedManager   : Affinity applying from full message performed in 375 ms.
2019-07-30 14:11:48.864  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Affinity changes applied in 379 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Full map updating for 873 groups performed in 16 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Finish exchange future [startVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], err=null]
2019-07-30 14:11:48.927  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Detecting lost partitions performed in 47 ms.
2019-07-30 14:11:49.280  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Completed partition exchange [localNode=4c8b23b4-ce12-4dbb-a7ea-9279711f4008, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=20eef25d-b7ec-4340-9da8-1a5a35678ba5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.139, 192.168.1.181], sockAddrs=[/192.168.1.181:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.1.139:47500], discPort=47500, order=1357, intOrder=696, lastExchangeTime=1564495322589, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], durationFromInit=4411]
2019-07-30 14:11:49.289  INFO 26 --- [ange-worker-#43] .i.p.c.GridCachePartitionExchangeManager : Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion [topVer=1361, minorTopVer=0], rebTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], evt=NODE_JOINED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, client=true]
2019-07-30 14:11:50.127  INFO 26 --- [eout-worker-#23] org.apache.ignite.internal.IgniteKernal  :
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:16.964]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=50.33%, avg=7.59%, GC=22.8%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1537MB, free=24.64%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]