You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by richardg <ri...@dvdempire.com> on 2012/11/15 16:43:57 UTC

High Slave CPU Intermittently After Replication

Here is our setup:

Solr 4.0
Master replicates to three slaves after optimize

We have a problem were every so often after replication the CPU load on the
Slave servers maxes out and request come to a crawl.  

We do a dataimport every 10 minutes and depending on the number of updates
since the last optimize we run an update command with either
optimize=true&maxsegements=4 or just optimize=true (more than 1500 updates
since last full optimize).   We had this issue more often until we put the
optimize updates statements into our process.

Everything had been running great for a week or so until today after
replication everything maxed out on all three slaves, it isn't that things
get progressively worse, right after the replication the issue occurs.  The
only way to recover from it is to do an optimize=true update and once it
replicates out things return to normal where there isn't much load on the
slaves at all.

There isn't anyway to predict this issue and so far I haven't seen anything
in the logs that would offer any clues.  Any ideas?





--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

Thanks for the tips, in the meantime I increased the size of the filter
cache.  We are working w/ the Web team to pass better designed queries and
then I will adjust the filter cache size accordingly.



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4023102.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Otis Gospodnetic <ot...@gmail.com>.

I think Shawn may be right.
Filter caches are often pretty small because the number of distinct filter
queries in a system is typically very small.
For example, I just peeked at stats for search-lucene.com and the current
numbers for the filter cache are:

evictions : 0
size : 156
warmupTime : 3141

As you can see, just 156 in size.  But that is because this filter is
really only used for faceting, and there is a limited number of facets
offered.
So what do your filter queries look like that you need a filter cache with
size 1024 (and it looks like that's still not quite enough - if you hover
over the eviction line in those SPM graphs you will see if you have
evictions during normal times)

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Mon, Nov 26, 2012 at 4:41 PM, richardg <ri...@dvdempire.com> wrote:

> We started having high load again today a few times today, each time
> looking
> at SPM monitor our filter cache starts having high lookups and low hit
> rate,
> this is our filtercache setting:
>
> <filterCache class="solr.FastLRUCache"
>                  size="1024"
>                  initialSize="512"
>                  autowarmCount="512"/>
>
> would it be possibly too slow?
>
> here is the graph when it happens:
>
> <http://lucene.472066.n3.nabble.com/file/n4022464/filter_cache.png>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4022464.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: High Slave CPU Intermittently After Replication

Posted by Shawn Heisey <so...@elyograg.org>.

On 11/26/2012 2:41 PM, richardg wrote:
> We started having high load again today a few times today, each time looking
> at SPM monitor our filter cache starts having high lookups and low hit rate,
> this is our filtercache setting:
>
> <filterCache class="solr.FastLRUCache"
>                   size="1024"
>                   initialSize="512"
>                   autowarmCount="512"/>
>
> would it be possibly too slow?
>
> here is the graph when it happens:
>
> <http://lucene.472066.n3.nabble.com/file/n4022464/filter_cache.png>

Based on my own experiences, I would bet that you have some very very 
complex or large filter queries, and when they are executed against your 
brand new index searcher by your autowarmCount of 512, they go very slowly.

I started out with an autowarm count of 32 against a 256-entry 
filtercache.  I ultimately had to drop my autowarm count to FOUR in 
order to make sure that it took less than 30 seconds for autowarm to 
complete.  I was initially seeing autowarm times measured in minutes.

If this is what is happening to you, the advice is simple: Either 
drastically lower your filterCache autowarmCount, or fix your app so 
that your filter queries are smaller and less complex.

Thanks,
Shawn

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

We started having high load again today a few times today, each time looking
at SPM monitor our filter cache starts having high lookups and low hit rate,
this is our filtercache setting:

<filterCache class="solr.FastLRUCache"
                 size="1024"
                 initialSize="512"
                 autowarmCount="512"/>

would it be possibly too slow?

here is the graph when it happens:

<http://lucene.472066.n3.nabble.com/file/n4022464/filter_cache.png> 



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4022464.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

After making our caches a little bigger and doing some autowarming things
seem to be a lot better.  I'm glad it isn't a hardware issue.  We are using
faceting so I will see about making our caches more efficient.

Thanks so much everyone for your help



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4021603.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Thanks for the charts, very helpful - they show disk IO is not a problem,
JVM/GC is not a problem, index seems very optimized, the CPU is not a
problem, etc.
I see a big load spike, but I don't see waits (see CPU chart) and I don't
see major disk activity right when load spikes, so there is no disk
reading.  I suspect what is/was happening was that a cold searcher was
being exposed to queries, so search requests piled up a little, causing
high load.  That's probably why you also see Query Component graph going up
- requests are probably sitting in the Query Component, waiting for
Solr/Lucene to their thing after a new searcher is opened.  Are you using
sorting or facets?  If so, make sure your warmup queries use them, too.


Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Tue, Nov 20, 2012 at 11:54 AM, richardg <ri...@dvdempire.com> wrote:

> <http://lucene.472066.n3.nabble.com/file/n4021363/CPU.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/LOAD_SWAP.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/DISK.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/JVM.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/GC.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/Index.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/Latency.png>
> <http://lucene.472066.n3.nabble.com/file/n4021363/SolrComponents.png>
>
> This is from the last time we had issues around 12:00PM yesterday, I have
> since added in some cache warming and haven't had the issue since but have
> gone some time before w/out any issue before.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4021363.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

<http://lucene.472066.n3.nabble.com/file/n4021363/CPU.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/LOAD_SWAP.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/DISK.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/JVM.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/GC.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/Index.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/Latency.png> 
<http://lucene.472066.n3.nabble.com/file/n4021363/SolrComponents.png> 

This is from the last time we had issues around 12:00PM yesterday, I have
since added in some cache warming and haven't had the issue since but have
gone some time before w/out any issue before.



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4021363.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Otis Gospodnetic <ot...@gmail.com>.

I'd love to figure this one out.  Can you get your system to run as it did
before and reproduce the "maxes out" situation after replication?
Can you share some SPM screenshots, starting with system ones - CPU, load,
disk IO, and swap, and then JVM/GC ones, and then some Solr-specific ones,
like the index one, query latency, Solr Components, basically look for any
unusual spikes or dips around replication time which is when I think you
said you saw issues.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Fri, Nov 16, 2012 at 10:23 AM, richardg <ri...@dvdempire.com> wrote:

> We tried using MergeFactor setting but out CPU Load/Slow Query time issues
> were more widespread, optimizing the index always alleviated the issue that
> is why we are using it now.  Our index is 2 GB when optimized and would
> balloon to over 4 GB so we thought the issue was it was getting too big.
>
> I notice a small spike in CPU load after every replication but then after a
> couple of seconds load returns to normal (which is less that 25%) it is
> just
> sometimes (once in the last week) that it would spike and stay high (10
> minutes) until I optimized the index.  Before I would optimize the index
> after every commit the issue would occur more often.
>
> We would like to not optimize and use the built in Merging but we had
> before
> and the issue would occur more often.  We were thinking of trying a
> mergefactor of 2 again but I'm afraid this issues will return.
>
> I installed SPM and am monitoring it to see if it tells me anything, I can
> post the results on Monday and hopefully it will tell us something.
>
> At this time we aren't warming and caches, we weren't sure if this was an
> issue because our slowdowns weren't happening every time. Also, we are
> using
> the join functionality of Solr 4 if that helps.
>
> Thanks for your help
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4020743.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

Our 3 slaves are 2 3GHz dual core CPU machines w/ 8 GB RAM.

This is our JVM memory setup:

JAVA_OPTS="-Xms2048m -Xmx4096m -XX:PermSize=64M -XX:MaxPermSize=256M"

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4021049.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Erick Erickson <er...@gmail.com>.

that's very strange. How much memory are you giving the JVM? And how much
memory is on your machine?

If your index is cutting in half on optimize, then it sounds like you're
re-indexing everything. Optimize will squeeze out all the data left around
by document deletes or updates, so the only reason I can imagine that your
index drops by 50% if if you've replaced every document that was there
originally. And I'd also guess that you don't have enough activity to
trigger merges often enough to squeeze out the deleted documents' data.

But this sounds ever so much like you're running with not much memory and
are getting into heavy swapping or something like that s your index crosses
some threshold.

But that's just a guess.

Best
Erick

On Fri, Nov 16, 2012 at 10:23 AM, richardg <ri...@dvdempire.com> wrote:

> We tried using MergeFactor setting but out CPU Load/Slow Query time issues
> were more widespread, optimizing the index always alleviated the issue that
> is why we are using it now.  Our index is 2 GB when optimized and would
> balloon to over 4 GB so we thought the issue was it was getting too big.
>
> I notice a small spike in CPU load after every replication but then after a
> couple of seconds load returns to normal (which is less that 25%) it is
> just
> sometimes (once in the last week) that it would spike and stay high (10
> minutes) until I optimized the index.  Before I would optimize the index
> after every commit the issue would occur more often.
>
> We would like to not optimize and use the built in Merging but we had
> before
> and the issue would occur more often.  We were thinking of trying a
> mergefactor of 2 again but I'm afraid this issues will return.
>
> I installed SPM and am monitoring it to see if it tells me anything, I can
> post the results on Monday and hopefully it will tell us something.
>
> At this time we aren't warming and caches, we weren't sure if this was an
> issue because our slowdowns weren't happening every time. Also, we are
> using
> the join functionality of Solr 4 if that helps.
>
> Thanks for your help
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4020743.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: High Slave CPU Intermittently After Replication

Posted by richardg <ri...@dvdempire.com>.

We tried using MergeFactor setting but out CPU Load/Slow Query time issues
were more widespread, optimizing the index always alleviated the issue that
is why we are using it now.  Our index is 2 GB when optimized and would
balloon to over 4 GB so we thought the issue was it was getting too big.

I notice a small spike in CPU load after every replication but then after a
couple of seconds load returns to normal (which is less that 25%) it is just
sometimes (once in the last week) that it would spike and stay high (10
minutes) until I optimized the index.  Before I would optimize the index
after every commit the issue would occur more often.

We would like to not optimize and use the built in Merging but we had before
and the issue would occur more often.  We were thinking of trying a
mergefactor of 2 again but I'm afraid this issues will return.

I installed SPM and am monitoring it to see if it tells me anything, I can
post the results on Monday and hopefully it will tell us something.

At this time we aren't warming and caches, we weren't sure if this was an
issue because our slowdowns weren't happening every time. Also, we are using
the join functionality of Solr 4 if that helps.

Thanks for your help 



--
View this message in context: http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520p4020743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Upayavira <uv...@odoko.co.uk>.

One question is, why optimise? The newer TieredMergePolicy, as I
understand it, takes away much of the need for optimising an index.

As to maxing, after a replication, your caches need warming. Watch how
often you replicate, nd check on the admin UI how long it takes to warm
caches. You may be maxing out memory by having multiple warming
searchers. 

Upayavira

On Thu, Nov 15, 2012, at 03:43 PM, richardg wrote:
> Here is our setup:
> 
> Solr 4.0
> Master replicates to three slaves after optimize
> 
> We have a problem were every so often after replication the CPU load on
> the
> Slave servers maxes out and request come to a crawl.  
> 
> We do a dataimport every 10 minutes and depending on the number of
> updates
> since the last optimize we run an update command with either
> optimize=true&maxsegements=4 or just optimize=true (more than 1500
> updates
> since last full optimize).   We had this issue more often until we put
> the
> optimize updates statements into our process.
> 
> Everything had been running great for a week or so until today after
> replication everything maxed out on all three slaves, it isn't that
> things
> get progressively worse, right after the replication the issue occurs. 
> The
> only way to recover from it is to do an optimize=true update and once it
> replicates out things return to normal where there isn't much load on the
> slaves at all.
> 
> There isn't anyway to predict this issue and so far I haven't seen
> anything
> in the logs that would offer any clues.  Any ideas?
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Slave CPU Intermittently After Replication

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

When you say it maxes out, what exactly does that mean?  What is your
monitoring tool saying? If you don't have one, see SPM for Solr. It would
be helpful to see performance before, during, and after replication without
optimize.  Disk IO, coin, load, warmup times, and cache situation would all
tell us a lot.  I bet we can avoid optimization.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Nov 15, 2012 10:44 AM, "richardg" <ri...@dvdempire.com> wrote:

> Here is our setup:
>
> Solr 4.0
> Master replicates to three slaves after optimize
>
> We have a problem were every so often after replication the CPU load on the
> Slave servers maxes out and request come to a crawl.
>
> We do a dataimport every 10 minutes and depending on the number of updates
> since the last optimize we run an update command with either
> optimize=true&maxsegements=4 or just optimize=true (more than 1500 updates
> since last full optimize).   We had this issue more often until we put the
> optimize updates statements into our process.
>
> Everything had been running great for a week or so until today after
> replication everything maxed out on all three slaves, it isn't that things
> get progressively worse, right after the replication the issue occurs.  The
> only way to recover from it is to do an optimize=true update and once it
> replicates out things return to normal where there isn't much load on the
> slaves at all.
>
> There isn't anyway to predict this issue and so far I haven't seen anything
> in the logs that would offer any clues.  Any ideas?
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/High-Slave-CPU-Intermittently-After-Replication-tp4020520.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>