You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by "Gajjar, Jigar" <ga...@oclc.org> on 2023/02/10 14:35:48 UTC
Only 1 node report high memory usage

Hi All,

We have cluster of 30 nodes and each node has 750gb of data.
There are 420 Shards. Shards and data are well distributed with all nodes.
JVM Settings ->

JDK :Amazon.com Inc. OpenJDK 64-Bit Server VM 17.0.1 17.0.1+12-LTS
Processor : 48
JVM Args:
Args
-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=8986
-Dcom.sun.management.jmxremote.ssl=false
-Denable.packages=true
-Denable.runtime.lib=true
-Djava.net.preferIPv4Stack=true
-Djetty.home=/prod/solrCI/8.11.1-191/solr-8.11.1/server
-Djetty.port=8983
-Djute.maxbuffer=10000000
-Dsolr.data.home=
-Dsolr.data.home=/prod/solr_data/inst1
-Dsolr.default.confdir=/prod/solrCI/8.11.1-191/solr-8.11.1/server/solr/configsets/_default/conf
-Dsolr.environment=prod,label=PROD2+PRODUCTION,color=#c9fdd6-Dsolr.install.dir=/prod/solrCI/8.11.1-191/solr-8.11.1
-Dsolr.jetty.inetaccess.excludes=
-Dsolr.jetty.inetaccess.includes=
-Dsolr.log.dir=/prod/solrCI/8.11.1-191/solr-8.11.1/server/logs
-Dsolr.solr.home=/prod/solr_home/inst1
-Duser.timezone=UTC
-DzkClientTimeout=30000
-DzkHost=<zookeeper_string>-XX:+UseNUMA-XX:+UseZGC
-XX:-OmitStackTraceInFastThrow
-XX:CompileCommand=exclude,com.github.benmanes.caffeine.cache.BoundedLocalCache::put
-XX:OnOutOfMemoryError=/prod/solrCI/8.11.1-191/solr-8.11.1/bin/oom_solr.sh 8983 /prod/solrCI/8.11.1-191/solr-8.11.1/server/logs
-XX:SoftMaxHeapSize=64g-Xlog:gc*:file=/prod/solrCI/8.11.1-191/solr-8.11.1/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M
-Xms88g
-Xmx88g
-Xss256k

What we observe is only one node shows high usage of heap and other nodes are well below threshold.
You can see in attached image.

[cid:image001.png@01D93D2D.6330B760]


Even if we bounce the node or entire cluster same issue comes back and it will be the same node which will report high heap usage.
We also try to reload collection but that does not help.
It is also weird that it is only one   node which will get all hit and sometimes it just dies.


We compared that machine with all other machine and made sure there is nothing different.

If anyone has any pointers to help then it is greatly appreciated.

Please let me know if you need more information.



Thanks,
Jigar Gajjar