You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by heaven <ah...@gmail.com> on 2014/12/18 23:36:59 UTC

Endless 100% CPU usage on searcherExecutor thread

Hi,

We have 2 shards, each one has 2 replicas and each Solr instance has a
single thread that constantly uses 100% of CPU:
<http://lucene.472066.n3.nabble.com/file/n4175088/Screenshot_896.png> 

After restart it is running normally for some time (approximately until Solr
comes close to Xmx limit), then the mentioned thread start consuming one
CPU. 4 solr instances = minus 4 CPU cores.

We do not commit manually and the search is not used too intensively.

{code}
<autoCommit>
  <maxDocs>25000</maxDocs>
  <maxTime>300000</maxTime>
  <openSearcher>false</openSearcher> 
</autoCommit>

<autoSoftCommit>
  <maxTime>15000</maxTime>
</autoSoftCommit>
{code}

So I was wondering if that's correct, if this is supposed to be or if
something is wrong with our configuration or with Solr.

Thanks,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

In general we do not have too complex filters, but I decreased the
filterCache autowarm count to 256, will see how it performs during a month
or so before take any changes on it.

It also seems that adding more shards could improve the situation. We have
16 CPU cores and SSD RAID 10, so I think it should be possible to increase
the number of shards from 2 to 5 or even 8.



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4176046.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/23/2014 2:31 AM, heaven wrote:
> We do not use dates here, at least not too often. Usually its something like
> type:Profile (we do use it from the rails application so type describes
> model names), opted_in:true, etc. Solr wasn't running too long though, so
> this could not show the real state.
>
> Currently for the filter cache it shows 1 and 0.84 of the query results. I
> also increased the cache size to
> autowarm: 512, initial: 1024 and size 4096, which is actually never reached
> because of commits.

Warming the filter cache *can* be very slow.  It all depends on exactly
what your filters are.  I had to reduce the autowarmCount on my
filterCache to *four* because if it was any higher, a commit would take
up to a minute.  We have some really complex filters.

Thanks,
Shawn

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

We do not use dates here, at least not too often. Usually its something like
type:Profile (we do use it from the rails application so type describes
model names), opted_in:true, etc. Solr wasn't running too long though, so
this could not show the real state.

Currently for the filter cache it shows 1 and 0.84 of the query results. I
also increased the cache size to
autowarm: 512, initial: 1024 and size 4096, which is actually never reached
because of commits.

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175725.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Erick Erickson <er...@gmail.com>.

Milliseconds. The thing to track here is your
cumulative_hitratio.

0.7 isn't bad, but it's not great either. I'd be really
curious what kinds of fq clauses you're entering,
anything that mentions NOW is potentially a
waste unless you round with "date math"....

FWIW,
Erick

On Mon, Dec 22, 2014 at 3:52 AM, heaven <ah...@gmail.com> wrote:
> It is getting better now with smaller caches like this:
> filterCache
> class:org.apache.solr.search.FastLRUCache
> version:1.0
> description:Concurrent LRU Cache(maxSize=4096, initialSize=512,
> minSize=3686, acceptableSize=3891, cleanupThread=false, autowarmCount=256,
> regenerator=org.apache.solr.search.SolrIndexSearcher$2@4668b788)
> src:null
> stats:
> lookups:34
> hits:33
> hitratio:0.97
> inserts:1
> evictions:0
> size:282
> warmupTime:1879
> cumulative_lookups:51190
> cumulative_hits:35938
> cumulative_hitratio:0.7
> cumulative_inserts:15252
> cumulative_evictions:0
>
> Is warmupTime in milliseconds or seconds?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175553.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

It is getting better now with smaller caches like this:
filterCache
class:org.apache.solr.search.FastLRUCache
version:1.0
description:Concurrent LRU Cache(maxSize=4096, initialSize=512,
minSize=3686, acceptableSize=3891, cleanupThread=false, autowarmCount=256,
regenerator=org.apache.solr.search.SolrIndexSearcher$2@4668b788)
src:null
stats:
lookups:34
hits:33
hitratio:0.97
inserts:1
evictions:0
size:282
warmupTime:1879
cumulative_lookups:51190
cumulative_hits:35938
cumulative_hitratio:0.7
cumulative_inserts:15252
cumulative_evictions:0

Is warmupTime in milliseconds or seconds?



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Erick Erickson <er...@gmail.com>.

50K is still very, very large. You say you have 50M docs/node. Each
filterCache entry will be on the order of 6M. Times 50,000 (potential
if you turn indexing off). Or 300G memory for your filter cache alone.
There are OOMs out there with your name on them, just waiting to
happen at 3:00 AM after you've been at a party....

The only thing I suspect is saving you is your soft commit interval is
short enough that you don't have a chance to build up that many cache
entries, check the size on the solr admin page to see if I'm on the
right track here....

And your excessive autowarm settings are very likely to be the source
of your CPU utilization, at least that's what I'd investigate first.

Best,
Erick

On Fri, Dec 19, 2014 at 11:29 AM, heaven <ah...@gmail.com> wrote:
> Okay, thanks for the suggestion, will try to decrease the caches gradually.
> Each node has near 50 000 000 docs, perhaps we need more shards...
>
> We had smaller caches before but that was leading to bad feedback from our
> users. Besides our application users we also use Solr internally for data
> analyze (very basic, simple searches for lists of keywords to determine docs
> category, but we run a lot of such queries).
>
> Previously it was possible to point those internal queries to one node
> (replica) and queries received from the users to another, so the caches did
> not interfere with each other. Not sure how to do this now with SolrCloud,
> it seems doesn't matter to which node we send requests, SolrCloud decides
> which nodes will process it. Am I wrong?
>
> Best,
> Alex
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175285.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

Okay, thanks for the suggestion, will try to decrease the caches gradually.
Each node has near 50 000 000 docs, perhaps we need more shards...

We had smaller caches before but that was leading to bad feedback from our
users. Besides our application users we also use Solr internally for data
analyze (very basic, simple searches for lists of keywords to determine docs
category, but we run a lot of such queries).

Previously it was possible to point those internal queries to one node
(replica) and queries received from the users to another, so the caches did
not interfere with each other. Not sure how to do this now with SolrCloud,
it seems doesn't matter to which node we send requests, SolrCloud decides
which nodes will process it. Am I wrong?

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175285.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Erick Erickson <er...@gmail.com>.

As Shalin points out, these cache sizes are waaaay out the norm.

For filterCache, each entry is roughly maxDoc/8. You haven't told
us now many docs are on the node, but you can find maxDoc on
the admin page. What I _have_ seen is a similar situation and
if you ever stop indexing you'll get OOM errors.

Here's the scenario:
Every time you commit, the cache is thrown away and up to
32,000 autowarm queries are fired. So entries 32,001 -> 512,000
are given back to the OS. You may only bet another 10,000 filter
queries before the next commit, so this cache is capped.
But if you ever stop indexing (and thus committing), you'll keep adding
to the cache and blow memory.

The filterCache and queryResult caches are maps. The key is
the query (or filter query) and the value is some representation
of the matching documents. You are set up to execute 48,000
queries every time you commit while indexing, every 15 seconds
in your case (soft commit interval). The only thing that's saving
you I suspect is that your cache isn't actually being filled up to
anywhere near een 32,000. But here's another
prediction; If you keep running this with varying queries, you'll
get slower and slower and slower. You'll see WARN messages
in your log about "max warming searchers exceeded". And
eventually you'll blow up memory. You can simulate this by having
continually submitting fq clauses with bare NOW clauses, like
fq=date:[* TO NOW]....

Really, start with your caches closer to 512 and an autowarm of
16 or so. Look at the admin page for your hit ratios and adjust.

Best,
Erick

On Fri, Dec 19, 2014 at 10:25 AM, heaven <ah...@gmail.com> wrote:
> Thanks, decreased the caches at twice, increased the heap size to 16G,
> configured Huge Pages and added these options:
> -XX:+UseConcMarkSweepGC
> -XX:+UseLargePages
> -XX:+CMSParallelRemarkEnabled
> -XX:+ParallelRefProcEnabled
> -XX:+UseLargePages
> -XX:+AggressiveOpts
> -XX:CMSInitiatingOccupancyFraction=75
>
> Best,
> Alex
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175271.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

Thanks, decreased the caches at twice, increased the heap size to 16G,
configured Huge Pages and added these options:
-XX:+UseConcMarkSweepGC
-XX:+UseLargePages
-XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled
-XX:+UseLargePages
-XX:+AggressiveOpts
-XX:CMSInitiatingOccupancyFraction=75

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175271.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Those are huge cache sizes. My guess is that the searchExecutor thread is
spending too much time doing warming. Garbage collection may also be a
factor as other people pointed out.

On Fri, Dec 19, 2014 at 12:50 PM, heaven <ah...@gmail.com> wrote:
>
> I have the next settings in my solrconfig.xml:
>
> <filterCache class="solr.FastLRUCache"
>                  size="512000"
>                  initialSize="64000"
>                  autowarmCount="32000"/>
>
> <queryResultCache class="solr.LRUCache"
>                  size="256000"
>                  initialSize="32000"
>                  autowarmCount="16000"/>
>
> <documentCache class="solr.LRUCache"
>                  size="128000"
>                  initialSize="16000"
>                  autowarmCount="8000"/>
>
> What is the best way to calculate the optimal cache/heap sizes? I
> understand
> there's no a common formula and all docs have different size but -Xmx is
> already 12G.
>
> Thanks,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175227.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by heaven <ah...@gmail.com>.

I have the next settings in my solrconfig.xml:

<filterCache class="solr.FastLRUCache"
                 size="512000"
                 initialSize="64000"
                 autowarmCount="32000"/>

<queryResultCache class="solr.LRUCache"
                 size="256000"
                 initialSize="32000"
                 autowarmCount="16000"/>

<documentCache class="solr.LRUCache"
                 size="128000"
                 initialSize="16000"
                 autowarmCount="8000"/>

What is the best way to calculate the optimal cache/heap sizes? I understand
there's no a common formula and all docs have different size but -Xmx is
already 12G.

Thanks,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088p4175227.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Erick Erickson <er...@gmail.com>.

Right, I've seen situation where as Solr is using a high percentage of the
available memory, Java spends more and more time in GC cycles. Say
you've allocated 8G to the heap. Say further that the "steady state" for
Solr needs 7.5g (numbers made up...). Now the GC algorithm only has
0.5G to play with and it spends a lot of time compacting that 0.05G.

But, if you increase the heap to, say, 12G, the background GC threads
have much greater opportunities to collect unused memory without
interrupting normal Solr processing....


So as Michael says, bumping the -Xmx up may alleviate the problem
entirely.

Best,
Erick

On Thu, Dec 18, 2014 at 4:50 PM, Michael Della Bitta
<mi...@appinions.com> wrote:
> I've been experiencing this problem. Running VisualVM on my instances shows
> that they spend a lot of time creating WeakReferences
> (org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference that is). I
> think what's happening here is the heap's not big enough for Lucene's caches
> and it ends up thrashing.
>
> You might try bumping up your heap some to see if that helps. It's made a
> difference for me, but mostly in delaying the onset and limiting the
> occurrence of this. Likely I just need an even larger heap.
>
> Michael
>
>
>
> On 12/18/14 17:36, heaven wrote:
>>
>> Hi,
>>
>> We have 2 shards, each one has 2 replicas and each Solr instance has a
>> single thread that constantly uses 100% of CPU:
>> <http://lucene.472066.n3.nabble.com/file/n4175088/Screenshot_896.png>
>>
>> After restart it is running normally for some time (approximately until
>> Solr
>> comes close to Xmx limit), then the mentioned thread start consuming one
>> CPU. 4 solr instances = minus 4 CPU cores.
>>
>> We do not commit manually and the search is not used too intensively.
>>
>> {code}
>> <autoCommit>
>>    <maxDocs>25000</maxDocs>
>>    <maxTime>300000</maxTime>
>>    <openSearcher>false</openSearcher>
>> </autoCommit>
>>
>> <autoSoftCommit>
>>    <maxTime>15000</maxTime>
>> </autoSoftCommit>
>> {code}
>>
>> So I was wondering if that's correct, if this is supposed to be or if
>> something is wrong with our configuration or with Solr.
>>
>> Thanks,
>> Alex
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Endless 100% CPU usage on searcherExecutor thread

Posted by Michael Della Bitta <mi...@appinions.com>.

I've been experiencing this problem. Running VisualVM on my instances 
shows that they spend a lot of time creating WeakReferences
(org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference that is). 
I think what's happening here is the heap's not big enough for Lucene's 
caches and it ends up thrashing.

You might try bumping up your heap some to see if that helps. It's made 
a difference for me, but mostly in delaying the onset and limiting the 
occurrence of this. Likely I just need an even larger heap.

Michael

On 12/18/14 17:36, heaven wrote:
> Hi,
>
> We have 2 shards, each one has 2 replicas and each Solr instance has a
> single thread that constantly uses 100% of CPU:
> <http://lucene.472066.n3.nabble.com/file/n4175088/Screenshot_896.png>
>
> After restart it is running normally for some time (approximately until Solr
> comes close to Xmx limit), then the mentioned thread start consuming one
> CPU. 4 solr instances = minus 4 CPU cores.
>
> We do not commit manually and the search is not used too intensively.
>
> {code}
> <autoCommit>
>    <maxDocs>25000</maxDocs>
>    <maxTime>300000</maxTime>
>    <openSearcher>false</openSearcher>
> </autoCommit>
>
> <autoSoftCommit>
>    <maxTime>15000</maxTime>
> </autoSoftCommit>
> {code}
>
> So I was wondering if that's correct, if this is supposed to be or if
> something is wrong with our configuration or with Solr.
>
> Thanks,
> Alex
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-tp4175088.html
> Sent from the Solr - User mailing list archive at Nabble.com.