You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "rubi.hali" <ru...@gmail.com> on 2018/02/08 07:41:28 UTC

High CPU and Physical Memory Usage in solr with 4000 user load

Hi

We are using Solr-6.6.2 with one master and 4 Slaves each having 4 CPU core
and 16GB RAM. We were doing load testing with 4000 users and 800 odd search
keywords which resulted into 95% of CPU usage in less than 3 minutes and
affected our QUERY Responses. There was spike in physical memory also which
was not going down even when we stopped sending load.

Our JVM Heap given to Solr is 8G which still lefts with 8G for OS.

We have 4 lakh documents in solr.

Our Cache Configurations done in SOLR are

<documentCache class="solr.LRUCache"
                   size="50000"
                   initialSize="30000"
                   autowarmCount="0"/>

 <filterCache class="solr.FastLRUCache"
                 size="4096"
                 initialSize="512"
                 autowarmCount="20"/>

<queryResultCache class="solr.LRUCache"
                     size="256"
                     initialSize="256"
                     autowarmCount="0"/>


We have enabled autocommit 
     <autoCommit> 	   
       <maxDocs>${solr.autoCommit.maxDocs:25000}</maxDocs>
	   <maxTime>${solr.autoCommit.maxTime:60000}</maxTime> 
       <openSearcher>true</openSearcher> (This is true only in case of
slaves) 
     </autoCommit>

 We are also doing softcommit
 <autoSoftCommit> 
       <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> 
 </autoSoftCommit>

Our Queries are enabled with Grouping so Query Result Cache doesnt get used.
But still in heavy load we are seeing this behaviour which is resulting into
high response times.

Please suggest if there is any configuration mismatch or os issue which we
should resolve for bringing down our High Response Times
















--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Posted by "rubi.hali" <ru...@gmail.com>.
Hi Shawn

As asked, have attached the gc log and snapshot of top command
TopCommandSlave1.jpg
<http://lucene.472066.n3.nabble.com/file/t493418/TopCommandSlave1.jpg>   and 
regarding blocked threads, we are fetching facets and doing grouping with
the main query and for those fields docValues were not enabled. So we did
enable docValues for the same and saw there were no such blocked threads
now.

Regarding Custom Handler, Its just a wrapper which is changing  qf parameter
based on some conditions so that should not be a problem. 

Now CPU usage also came down to 70% but still there is  a concern of
continuous spike in physical memory even though our heap is not getting
utilized that much.

Also there are blocked threads on QueueThreadPool ?? Not sure if this is an
issue or its expected as they should get released immediately.
solr_gc.current
<http://lucene.472066.n3.nabble.com/file/t493418/solr_gc.current>  




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/9/2018 12:48 AM, rubi.hali wrote:
> As threads are blocked , our cpu is reaching to 90% even even when heap or
> os memory is not getting used at all.
> 
> One of the BLOCKED Thread Snippet
> Most of the threads are blocked either on Jetty Connector or FieldCacheIMPL
> for DOC values.  Slave1ThreadDump5.txt
> <http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump5.txt>
> Slave1ThreadDump4.txt
> <http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump4.txt>

Threads being blocked shouldn't cause high CPU.  It would generally be 
caused by whatever is causing the threads to be blocked.

I uploaded the Slave1ThreadDump5.txt file from the first link to the 
http://fastthread.io website.

The analysis noticed that 45 threads were blocked by another thread.  At 
the end of this message is the stacktrace for the thread blocking the 
others.

What jumps out at me is the mention of a custom plugin for Solr in the 
stacktrace.  Here's the method mentioned:

com.ril.solr.request.handler.RILMultipleQuerySearchHandler.handleRequestBody

Could there be a bug in that handler causing problems?  I can't say for 
sure, but this IS something out of the ordinary.

This thread is blocking other threads, but it is itself blocked.   Says 
it's waiting for 0x00007f24d881e000, but this entry does not appear 
anywhere else in the thread dump.  So I can't tell what's actually 
blocking it.

I also tried putting the thread dump into this page:

https://spotify.github.io/threaddump-analyzer/

The analysis there says that qtp1205044462-196 is "inconsistent" and the 
mouseover for that says that the Thread is blocked on "BLOCKED (on 
object monitor)" without waiting for anything.

Can't find anything definitive, but it MIGHT be blocked by java's own 
internal operations.  Things like memory allocations, GC pauses, etc.

I noticed that the stacktrace includes mention of Solr's facet classes, 
suggesting that these queries include facets or grouping.  Facets and 
grouping can require very large amounts of heap, especially if the 
fields being used do not have docValues enabled.  You may need to make 
your heap larger so that there are fewer garbage collection events.

I have no idea what the custom search handler is going to do to Solr's 
memory requirements.

Solr should create GC logs if you have started it with the included 
script.  Can you share a GC log that includes the time the load testing 
is underway and the problem appears?

Can you log onto the Linux server, run the "top" program (not htop or 
any other variant), press shift-M to sort the list by memory, grab a 
screenshot, and share it?

Thanks,
Shawn


========================
"qtp1205044462-196" #196 prio=5 os_prio=0 tid=0x00007f2374002800 
nid=0x5171 waiting for monitor entry [0x00007f24d881e000]
    java.lang.Thread.State: BLOCKED (on object monitor)
	at 
org.apache.solr.uninverting.FieldCacheImpl$Cache.get(FieldCacheImpl.java:167)
	- locked <0x0000000642cd7568> (a java.util.WeakHashMap)
	at 
org.apache.solr.uninverting.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:767)
	at 
org.apache.solr.uninverting.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:747)
	at 
org.apache.solr.uninverting.UninvertingReader.getSortedDocValues(UninvertingReader.java:319)
	at 
org.apache.lucene.index.FilterLeafReader.getSortedDocValues(FilterLeafReader.java:448)
	at org.apache.lucene.index.DocValues.getSorted(DocValues.java:262)
	at 
org.apache.lucene.search.grouping.term.TermGroupFacetCollector$SV.doSetNextReader(TermGroupFacetCollector.java:128)
	at 
org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:659)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
	at 
org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:692)
	at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:476)
	at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:405)
	at 
org.apache.solr.request.SimpleFacets.lambda$getFacetFieldCounts$0(SimpleFacets.java:803)
	at 
org.apache.solr.request.SimpleFacets$$Lambda$207/1284277177.call(Unknown 
Source)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.solr.request.SimpleFacets$3.execute(SimpleFacets.java:742)
	at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:818)
	at 
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:330)
	at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:274)
	at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:296)
	at 
com.ril.solr.request.handler.RILMultipleQuerySearchHandler.handleRequestBody(RILMultipleQuerySearchHandler.java:39)
	at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
	at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
	at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
	at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
	at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
	at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
	at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:534)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
	at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)

    Locked ownable synchronizers:
	- None
========================

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Posted by "rubi.hali" <ru...@gmail.com>.
Hi Shawn

We tried one more round of testing after We increased CPU cores on each
server from 4 to 16 which amounts to 64 cores in total against 4 slaves.But
Still CPU Usage was higher.So We took the thread dumps on one of our slaves
and found that threads were blocked. Have also attached them.

As far as document size is considered, we have only 1.5 GB index(only one)
size amounting to 4 lakh docs  on each server and query load in span of 10
minutes on each of slaves was distributed as
slave 1- 1258
slave 2 - 512
slave 3 - 256
slave 4 - 1384

We are using Linux OS. 

As threads are blocked , our cpu is reaching to 90% even even when heap or
os memory is not getting used at all.

One of the BLOCKED Thread Snippet
Most of the threads are blocked either on Jetty Connector or FieldCacheIMPL
for DOC values.  Slave1ThreadDump5.txt
<http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump5.txt>  
Slave1ThreadDump4.txt
<http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump4.txt>  

"qtp1205044462-26-acceptor-1@706f9d47-ServerConnector@7ea87d3a{HTTP/1.1,[http/1.1]}{0.0.0.0:8983}"
#26 prio=5 os_prio=0 tid=0x00007f2530501800 nid=0x4f9c waiting for monitor
entry [0x00007f2505eb8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:233)
	- waiting to lock <0x000000064067fc68> (a java.lang.Object)
	at
org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:373)
	at
org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/8/2018 12:41 AM, rubi.hali wrote:
> We are using Solr-6.6.2 with one master and 4 Slaves each having 4 CPU core
> and 16GB RAM. We were doing load testing with 4000 users and 800 odd search
> keywords which resulted into 95% of CPU usage in less than 3 minutes and
> affected our QUERY Responses. There was spike in physical memory also which
> was not going down even when we stopped sending load.

How many queries per second is that sending?  Thousands of queries per
second may need more than five servers.

> Our JVM Heap given to Solr is 8G which still lefts with 8G for OS.
>
> We have 4 lakh documents in solr.

How big (disk space) is all the index data being handled by one of your
Solr servers?  Does all that index data add up to 400000 documents, or
is that just the document count for one index out of multiple?

> Our Cache Configurations done in SOLR are
>
> <documentCache class="solr.LRUCache"
>                    size="50000"
>                    initialSize="30000"
>                    autowarmCount="0"/>

This is fairly large.  May not be a problem, but that will depend on how
big each document is.

>  <filterCache class="solr.FastLRUCache"
>                  size="4096"
>                  initialSize="512"
>                  autowarmCount="20"/>

This is quite large for a filterCache.  It probably needs to be
smaller.  Be careful with autowarming on the filterCache -- sometimes
filters can execute very slowly.  I had to reduce my autowarmCount on
the filterCache to *four* in order for commits to be fast enough.

> <queryResultCache class="solr.LRUCache"
>                      size="256"
>                      initialSize="256"
>                      autowarmCount="0"/>

I would probably use a non-zero autowarmCount here.  But don't make it
TOO large.

> We have enabled autocommit 
>      <autoCommit> 	   
>        <maxDocs>${solr.autoCommit.maxDocs:25000}</maxDocs>
> 	   <maxTime>${solr.autoCommit.maxTime:60000}</maxTime> 
>        <openSearcher>true</openSearcher> (This is true only in case of
> slaves) 
>      </autoCommit>

I would recommend that you remove maxDocs and let it just be
time-based.  Also, *all* servers should have openSearcher set to false
on autoCommit.

>  We are also doing softcommit
>  <autoSoftCommit> 
>        <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> 
>  </autoSoftCommit>

This is good.  You could probably lower it to two minutes instead of five.

> Our Queries are enabled with Grouping so Query Result Cache doesnt get used.
> But still in heavy load we are seeing this behaviour which is resulting into
> high response times.
>
> Please suggest if there is any configuration mismatch or os issue which we
> should resolve for bringing down our High Response Times

If your query load is going to be high, you probably need to add more
servers.

To check whether things are optimized well on each server:  What OS is
it running on?  I would like to get some more information from your
install, but the exact method for obtaining that information will vary
depending on the OS.

Can you also provide a large solr_gc.log file so that can be analyzed?

Thanks,
Shawn