You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Anthony Groves <ag...@oreilly.com> on 2020/04/14 13:07:14 UTC

Giant ConcurrentMarkSweep GC activity causing issues in Solr 7

Hi everyone,

Hoping to get your thoughts on a nasty GC issue we’ve been having after
upgrading our cluster to Solr 7. Our cluster is NOT Solr Cloud, but rather
one master node (handles all indexing) and four slave nodes replicating
from master (handles all search queries, round robin load balancer in
front).


Periodically, we will see one or two of our Solr slave nodes become
unresponsive due to high CPU and memory usage from garbage collection
activities. The “incident” almost always follows this same trend:



   -

   50%-70% CPU usage from the GC ParNew activity, and during that time
   about 10% CPU usage from a ConcurrentMarkSweep.
   -

   This goes on for about 8 minutes, and then ParNew activity stops but
   ConcurrentMarkSweep continues, now taking up almost 100% of the CPU.
   -

   During this entire time, the overall Heap memory usage and the CMS Old
   Gen heap usage is steadily increasing (it never resets) and gets to a point
   where there is no available memory left.
   -

   A searcher autowarm is usually triggered sometime during this incident,
   and it lasts longer than autowarm usually does, but I assume that’s because
   so much CPU is being used by the GC.


Does anyone have advice on what we can do to avoid this event, or mitigate
its effects? I’m not married to any of our garbage collection configs, and
am open to switching to other methods like G1GC if that is recommended.

*Java version: *OpenJDK 64-Bit Server VM 11.0.6

*Memory:* 20GB, 26GB MAX. 40GB total on the server.

*GC configs* CMSInitiatingOccupancyFraction=15
CMSParallelRemarkEnabled
CMSScavengeBeforeRemark

ParallelRefProcEnabled

UseCMSInitiatingOccupancyOnly

UseConcMarkSweepGC

OmitStackTraceInFastThrow

CMSMaxAbortablePrecleanTime=6000

ConcGCThreads=4

MaxTenuringThreshold=8

NewRatio=3

ParallelGCThreads=4

PretenureSizeThreshold=64m

SurvivorRatio=4

TargetSurvivorRatio=90

Our actual memory need is closer to 14-16GB from cache usage. However, it
seemed like I was able to stop the “incident” from happening nearly as
often by increasing the base memory to 20GB, but decreasing
CMSInitiatingOccupancyFraction to 15%, making GC be triggered more often
(but leaving a lot more extra heap memory that can be used during GC).
Pretty hacky, totally open to suggestions :-)

*Additional details on the cluster:*

   -

   Autocommit (hard, openSearcher) on the master node is done every 5
   minutes.
   -

   Slave nodes replication poll from master is every 20 seconds.
   -

   Searcher warming seems to take up a good amount of CPU (30%-80%) and is
   typically done every 5-10 minutes (happens every time replication after
   master autocommit has changes)
   -

   The 4 slave nodes get 500-1000 requests per minute total, including
   search queries and fast suggester queries.
   -

   Slave nodes sometimes handle heavy facet queries (responses are 300%
   slower than usual)
   -

   Slave nodes sometimes handle deep paging with “start” param


Any advice at all is appreciated. Thanks all! Anthony