You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by ihalilaltun <ib...@segmentify.com> on 2019/11/21 14:08:01 UTC

excessive timeouts and load on new cache creations

Hi Igniters,

Everytime a new cache is created dynamically we get
exessive number of timeouts and huge load on grid nodes. Current grid-node
metrics are the followings;

[2019-11-21T14:03:16,079][INFO ][grid-timeout-worker-#199][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=5f730d51, uptime=54 days, 22:36:17.169]
    ^-- H/N/C [hosts=51, nodes=52, CPUs=290]
    ^-- CPU [cur=14.27%, avg=14.48%, GC=0%]
    ^-- PageMemory [pages=7115228]
    ^-- Heap [used=4278MB, free=47.77%, comm=8192MB]
    ^-- Off-heap [used=28119MB, free=2.94%, comm=28972MB]
    ^--   sysMemPlc region [used=0MB, free=99.99%, comm=100MB]
    ^--   default region [used=28118MB, free=1.93%, comm=28672MB]
    ^--   metastoreMemPlc region [used=1MB, free=98.89%, comm=100MB]
    ^--   TxLog region [used=0MB, free=100%, comm=100MB]
    ^-- Ignite persistence [used=36102MB]
    ^--   sysMemPlc region [used=0MB]
    ^--   default region [used=36102MB]
    ^--   metastoreMemPlc region [used=unknown]
    ^--   TxLog region [used=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=128, qSize=0]


ignite logs and gc logs;
ignite logs;
ignite.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite.zip>  
gc logs;
gc.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/gc.zip>  
ignite configuration xml;
ignite-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite-config.xml>  


any thougs and suggestions for configuraiton optimization?

Regards



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: excessive timeouts and load on new cache creations

Posted by ihalilaltun <ib...@segmentify.com>.

Hi Anton,

We have faced the same bug onnon-byte array types also.
here is the pojo we use;
UserMailInfo.java
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/UserMailInfo.java>  

I've already read the topic you shared, thanks :)




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: excessive timeouts and load on new cache creations

Posted by akurbanov <an...@gmail.com>.

Hello,

This bug is related to byte arrays in first place, do you store them as
values in caches?

And yes, PME waits for finishing all pending cache operations and makes
cache operations that were started after PME initialization to wait until
finish. Reducing the partition map sizes by using cache groups as suggested
by Pavel will definitely speed up this process making application to use
less heap and send smaller messages over the network.

You can find some more information in this thread on PME topic:
http://apache-ignite-users.70518.x6.nabble.com/Partition-map-exchange-in-detail-td23970.html

Best regards,
Anton




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: excessive timeouts and load on new cache creations

Posted by ihalilaltun <ib...@segmentify.com>.

Hi Pavel,

Thanks for you reply and suggestions but currenly we cannot use
cache-groups. As you know there is a know bug for it -> 
https://issues.apache.org/jira/browse/IGNITE-11953




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: excessive timeouts and load on new cache creations

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hi Ibrahim,

I see you have 317 cache groups in your cluster `Full map updating for 317
groups performed in 105 ms.`
Each cache group has own partition map and affinity map that require memory
which resides in old-gen.
During cache creation, a distributed PME happens and all partition and
affinity maps are updated.
This results in huge memory consumption and leads to long GC pauses.

It' recommended to use as few cache groups as possible. If your caches have
the same affinity distribution you can place them at one cache group.
It should help to reduce memory consumption.



чт, 21 нояб. 2019 г. в 19:24, ihalilaltun <ib...@segmentify.com>:

> Hi Anton,
>
> Timeouts can be found at the logs that i shared;
>
> [query-#13207879][GridMapQueryExecutor] Failed to execute local query.
> org.apache.ignite.cache.query.QueryCancelledException: The query was
> cancelled while executing.
>
> huge loads on server nodes are monitored via zabbix agent;
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/SS_2.png>
>
>
> just after cache creation we cannot return to requests, these metrics are
> monitored via prometheus, here is the SS;
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2515/SS.png>
>
> for some reason, timeouts occur after cache proxy initializations (cache
> creations)
>
>
>
> -----
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: excessive timeouts and load on new cache creations

Posted by ihalilaltun <ib...@segmentify.com>.

Hi Anton,

Timeouts can be found at the logs that i shared;

[query-#13207879][GridMapQueryExecutor] Failed to execute local query.
org.apache.ignite.cache.query.QueryCancelledException: The query was
cancelled while executing.

huge loads on server nodes are monitored via zabbix agent;
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/SS_2.png> 


just after cache creation we cannot return to requests, these metrics are
monitored via prometheus, here is the SS;
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/SS.png> 

for some reason, timeouts occur after cache proxy initializations (cache
creations)



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: excessive timeouts and load on new cache creations

Posted by akurbanov <an...@gmail.com>.

Hello, could you please provide the samples for such timeouts and huge load,
how do you measure it?

Quick glance shows only one really large VM pause:

2019-11-21T10:28:30.062+0000: 4734099.138: [Full GC (Allocation Failure) 
8169M->3377M(8192M), 8.7213845 secs]
   [Eden: 0.0B(408.0M)->0.0B(408.0M) Survivors: 0.0B->0.0B Heap:
8169.4M(8192.0M)->3377.2M(8192.0M)], [Metaspace: 63255K->63244K(1112064K)]
 [Times: user=14.93 sys=0.00, real=8.72 secs] 

You can also take a look on a single cache start process that could be
extracted from the provided log file, the cache is
acc_a385248f4e6fa000_searchKeywords:

[2019-11-21T07:31:18,723][INFO ][exchange-worker-#219][time] Started
exchange init [topVer=AffinityTopologyVersion [topVer=2118, minorTopVer=13],
mvccCrd=MvccCoordinator [nodeId=5704fca2-4b03-4a5d-b6fb-ca1d8eda94d6,
crdVer=1569597849947, topVer=AffinityTopologyVersion [topVer=1237,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=DISCOVERY_CUSTOM_EVT,
evtNode=acabfdd3-24aa-4e0a-b6a6-1a9f592f6e39,
customEvt=DynamicCacheChangeBatch
[id=cdb120a8e61-2fa96531-1dbc-4218-bffd-ae79e85c4042,
reqs=[DynamicCacheChangeRequest
[cacheName=acc_a385248f4e6fa000_searchKeywords, hasCfg=true,
nodeId=acabfdd3-24aa-4e0a-b6a6-1a9f592f6e39, clientStartOnly=false,
stop=false, destroy=false, disabledAfterStartfalse]],
exchangeActions=ExchangeActions
[startCaches=[acc_a385248f4e6fa000_searchKeywords], stopCaches=null,
startGrps=[], stopGrps=[], resetParts=null, stateChangeRequest=null],
startCaches=false], allowMerge=false]
[2019-11-21T07:31:18,727][INFO
][exchange-worker-#219][CacheAffinitySharedManager] Updating caches registry
performed in 3 ms.
[2019-11-21T07:31:18,730][INFO ][exchange-worker-#219][GridCacheProcessor]
Started cache [name=acc_a385248f4e6fa000_searchKeywords, id=-333924348,
group=searchKeywordsCacheGroup, memoryPolicyName=default, mode=PARTITIONED,
atomicity=ATOMIC, backups=1, mvcc=false], encryptionEnabled=false]
[2019-11-21T07:31:18,730][INFO
][exchange-worker-#219][CacheAffinitySharedManager] Caches starting
performed in 3 ms.
[2019-11-21T07:31:18,730][INFO
][exchange-worker-#219][CacheAffinitySharedManager] Affinity initialization
for started caches performed in 0 ms.
[2019-11-21T07:31:19,235][INFO
][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Finished waiting
for partition release future [topVer=AffinityTopologyVersion [topVer=2118,
minorTopVer=13], waitTime=490ms, futInfo=NA, mode=DISTRIBUTED]
[2019-11-21T07:31:19,526][INFO
][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Finished waiting
for partitions release latch: ClientLatch [coordinator=TcpDiscoveryNode
[id=5704fca2-4b03-4a5d-b6fb-ca1d8eda94d6, addrs=[127.0.0.1, 192.168.199.55],
sockAddrs=[/127.0.0.1:47500, /192.168.199.55:47500], discPort=47500,
order=10, intOrder=10, lastExchangeTime=1569598018473, loc=false,
ver=2.7.6#20190911-sha1:21f7ca41, isClient=false], ackSent=true,
super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion
[topVer=2118, minorTopVer=13]]]
[2019-11-21T07:31:19,527][INFO
][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Finished waiting
for partition release future [topVer=AffinityTopologyVersion [topVer=2118,
minorTopVer=13], waitTime=0ms, futInfo=NA, mode=LOCAL]
[2019-11-21T07:31:19,527][INFO
][exchange-worker-#219][GridCacheDatabaseSharedManager] Logical recovery
performed in 0 ms.
[2019-11-21T07:31:19,691][INFO
][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Sending Single
Message performed in 87 ms.
[2019-11-21T07:31:19,691][INFO ][exchange-worker-#219][time] Finished
exchange init [topVer=AffinityTopologyVersion [topVer=2118, minorTopVer=13],
crd=false]
[2019-11-21T07:31:22,669][INFO
][sys-#15799825][GridDhtPartitionsExchangeFuture] Received full message,
will finish exchange [node=5704fca2-4b03-4a5d-b6fb-ca1d8eda94d6,
resVer=AffinityTopologyVersion [topVer=2118, minorTopVer=13]]
[2019-11-21T07:31:22,669][INFO
][sys-#15799825][GridDhtPartitionsExchangeFuture] Affinity changes applied
in 0 ms.
[2019-11-21T07:31:22,775][INFO
][sys-#15799825][GridDhtPartitionsExchangeFuture] Full map updating for 317
groups performed in 105 ms.
[2019-11-21T07:31:22,775][INFO
][sys-#15799825][GridDhtPartitionsExchangeFuture] Finish exchange future
[startVer=AffinityTopologyVersion [topVer=2118, minorTopVer=13],
resVer=AffinityTopologyVersion [topVer=2118, minorTopVer=13], err=null]
[2019-11-21T07:31:23,015][INFO ][sys-#15799825][GridCacheProcessor] Finish
proxy initialization, cacheName=acc_a385248f4e6fa000_searchKeywords,
localNodeId=5f730d51-40de-49e5-aa3e-fa030080d553
[2019-11-21T07:31:23,017][INFO
][sys-#15799825][GridDhtPartitionsExchangeFuture] Completed partition
exchange [localNode=5f730d51-40de-49e5-aa3e-fa030080d553,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2118, minorTopVer=13], evt=DISCOVERY_CUSTOM_EVT,
evtNode=TcpDiscoveryNode [id=acabfdd3-24aa-4e0a-b6a6-1a9f592f6e39,
addrs=[127.0.0.1, 192.168.199.110], sockAddrs=[/127.0.0.1:0,
/192.168.199.110:0], discPort=0, order=2067, intOrder=1059,
lastExchangeTime=1574230853730, loc=false, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2118,
minorTopVer=13], durationFromInit=4290]
[2019-11-21T07:31:23,724][INFO
][exchange-worker-#219][GridCachePartitionExchangeManager] Skipping
rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=2118,
minorTopVer=13], force=false, evt=DISCOVERY_CUSTOM_EVT,
node=acabfdd3-24aa-4e0a-b6a6-1a9f592f6e39]

Each cache start/stop is a cluster-wide operation where it is expected that
each server node will start this cache, in order to do this the message with
DynamicCacheChangeBatch must be sent to coordinator and it will propagate it
through topology nodes ring. Also, the partition map must be recalculated
and updated on all server nodes so that everyone in topology does know the
current distribution of cache partitions between data nodes, so this might
get quite heavy given this is performed on a large topology and the network
must be considered also.

You can find this exchange being logged on all nodes and check the
timestamps of how the process took place and on which node.

However, I do not see any timeouts around this exchange process.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/