You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan Gardner <ry...@gmail.com> on 2012/12/01 23:05:09 UTC

A (seemingly) unavoidable bump in qtimes shortly after replication ends

I'm working on performance tuning / testing a solr 4 cluster - the slaves
are handling all the queries and they are currently replicating from the
master every minute.

Right now for my load test, I'm playing back random queries form a set of
about 5M queries harvested from our existing production servers. Whenever
the replication ends and the new searcher gets used there is a jump in
QTimes.

The QTimes jump briefly but ONLY after replication. I tried adding in some
warming queries to create facets etc - and I analyzed the queries and they
all seem relatively similar (use a large number of facets etc) - but even
when I have converted those exact queries over to the format that is in the
logs, it doesn't seem to eliminate the "post-repl bump" of qtimes jumping
from single digits to 500 - 800ms.

Here's an example of what i'm talking about.

-- replication ends (message about deleting temporary index in the logs)
date  |  qtime
21:50:45.159 | 0 |
21:50:45.170 | 1 |
21:50:45.174 | 0 |
21:50:45.182 | 2 |
21:50:45.185 | 78 |
21:50:45.194 | 1 |
21:50:45.201 | 69 |
21:50:45.206 | 0 |
21:50:45.211 | 0 |
21:50:45.219 | 0 |
21:50:45.286 | 1 |
21:50:45.288 | 0 |
21:50:45.301 | 30 |
21:50:45.317 | 15 |
21:50:45.327 | 2 |
21:50:45.327 | 1 |
21:50:45.334 | 0 |
21:50:45.337 | 1 |
21:50:45.345 | 1 |
21:50:45.347 | 2 |
21:50:45.392 | 19 |
21:50:45.415 | 47 |
21:50:45.428 | 1 |
21:50:45.438 | 2 |
21:50:45.453 | 1 |
21:50:45.468 | 3 |
21:50:45.507 | 4 |
21:50:45.551 | 1 |
21:50:45.617 | 92 |
21:50:45.617 | 251 |
21:50:45.619 | 457 |
21:50:45.632 | 1 |
21:50:45.731 | 500 |
21:50:45.731 | 437 |
21:50:45.731 | 526 |
21:50:45.731 | 514 |
21:50:45.731 | 354 |
21:50:45.731 | 531 |
21:50:45.731 | 525 |
21:50:45.731 | 525 |
21:50:45.732 | 502 |
21:50:45.732 | 452 |
21:50:45.732 | 278 |
21:50:45.732 | 527 |
21:50:45.732 | 270 |
21:50:45.732 | 576 |
21:50:45.733 | 221 |
21:50:45.733 | 225 |
21:50:45.734 | 265 |
21:50:45.735 | 370 |
21:50:45.737 | 551 |
21:50:45.737 | 517 |
21:50:45.738 | 440 |
21:50:45.738 | 477 |
21:50:45.738 | 43 |
21:50:45.738 | 299 |
21:50:45.739 | 541 |
21:50:45.825 | 1 |
21:50:45.838 | 5 |
21:50:45.848 | 0 |
21:50:45.852 | 0 |
21:50:45.859 | 19 |
21:50:45.875 | 6 |
21:50:45.876 | 7 |
21:50:45.881 | 0 |
21:50:45.883 | 14 |
21:50:45.886 | 1 |
21:50:45.890 | 4 |
21:50:45.891 | 3 |
21:50:45.894 | 1 |
21:50:45.902 | 1 |
21:50:45.906 | 4 |
21:50:45.908 | 3 |
21:50:45.918 | 18 |
21:50:45.921 | 10 |

The DocumentCache in our case is not very useful because of our 1 minute
replication pattern. I have it sized to 1024 elements now. When I've tried
to increase the size, it caused our GC pause times to skyrocket. (Currently
it's tuned so it has a 250ms GC pause roughly every 16 seconds - and I've
verified that the above QTime bump is not due to GC activity)

Is there something I can do to help with these? (short of increasing the
replication time to a longer interval to mitigate the impact of these
bumps)?

I guess the real question I have is "Why do queries get faster after a
second or so after replication? How can I try to get it to do that as part
of the newSearcher warming??" - Like I said, I've copied some of those slow
queries and put them into the newSearcher warming section to see if "well,
maybe running through a few dozen of these searches that's what makes it
get faster" - but that hasn't helped.

Our index file is stored on disk but the OS basically has it all cached in
RAM (I tested moving it to tmpfs but saw no improvement in speed - so I
went back to putting it on disk) - and the CPU is not anywhere near taxed
(it's got 24 cores on this machine)

So far the performance of SOLR has been stellar - and once we finish tuning
this we'll write up how we've tuned it / how we're using it to share out
more widely with anyone who cares - but the one perplexing thing is this
bump in bad times after replication.

Any thoughts would be appreciated.

Ryan

Re: A (seemingly) unavoidable bump in qtimes shortly after replication ends

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

A bit hard to tell without seeing all system and Solr metrics, but the
first thing I'd look at is disk IO.  Yes, you may warm up the index, but
you still have some new data on disk after replication, so if a user query
hits data in newly copied blocks, it will *have to* be read from disk.  See
SPM for Solr - it monitors this sort of stuff.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Sat, Dec 1, 2012 at 5:05 PM, Ryan Gardner <ry...@gmail.com> wrote:

> I'm working on performance tuning / testing a solr 4 cluster - the slaves
> are handling all the queries and they are currently replicating from the
> master every minute.
>
> Right now for my load test, I'm playing back random queries form a set of
> about 5M queries harvested from our existing production servers. Whenever
> the replication ends and the new searcher gets used there is a jump in
> QTimes.
>
> The QTimes jump briefly but ONLY after replication. I tried adding in some
> warming queries to create facets etc - and I analyzed the queries and they
> all seem relatively similar (use a large number of facets etc) - but even
> when I have converted those exact queries over to the format that is in the
> logs, it doesn't seem to eliminate the "post-repl bump" of qtimes jumping
> from single digits to 500 - 800ms.
>
> Here's an example of what i'm talking about.
>
> -- replication ends (message about deleting temporary index in the logs)
> date  |  qtime
> 21:50:45.159 | 0 |
> 21:50:45.170 | 1 |
> 21:50:45.174 | 0 |
> 21:50:45.182 | 2 |
> 21:50:45.185 | 78 |
> 21:50:45.194 | 1 |
> 21:50:45.201 | 69 |
> 21:50:45.206 | 0 |
> 21:50:45.211 | 0 |
> 21:50:45.219 | 0 |
> 21:50:45.286 | 1 |
> 21:50:45.288 | 0 |
> 21:50:45.301 | 30 |
> 21:50:45.317 | 15 |
> 21:50:45.327 | 2 |
> 21:50:45.327 | 1 |
> 21:50:45.334 | 0 |
> 21:50:45.337 | 1 |
> 21:50:45.345 | 1 |
> 21:50:45.347 | 2 |
> 21:50:45.392 | 19 |
> 21:50:45.415 | 47 |
> 21:50:45.428 | 1 |
> 21:50:45.438 | 2 |
> 21:50:45.453 | 1 |
> 21:50:45.468 | 3 |
> 21:50:45.507 | 4 |
> 21:50:45.551 | 1 |
> 21:50:45.617 | 92 |
> 21:50:45.617 | 251 |
> 21:50:45.619 | 457 |
> 21:50:45.632 | 1 |
> 21:50:45.731 | 500 |
> 21:50:45.731 | 437 |
> 21:50:45.731 | 526 |
> 21:50:45.731 | 514 |
> 21:50:45.731 | 354 |
> 21:50:45.731 | 531 |
> 21:50:45.731 | 525 |
> 21:50:45.731 | 525 |
> 21:50:45.732 | 502 |
> 21:50:45.732 | 452 |
> 21:50:45.732 | 278 |
> 21:50:45.732 | 527 |
> 21:50:45.732 | 270 |
> 21:50:45.732 | 576 |
> 21:50:45.733 | 221 |
> 21:50:45.733 | 225 |
> 21:50:45.734 | 265 |
> 21:50:45.735 | 370 |
> 21:50:45.737 | 551 |
> 21:50:45.737 | 517 |
> 21:50:45.738 | 440 |
> 21:50:45.738 | 477 |
> 21:50:45.738 | 43 |
> 21:50:45.738 | 299 |
> 21:50:45.739 | 541 |
> 21:50:45.825 | 1 |
> 21:50:45.838 | 5 |
> 21:50:45.848 | 0 |
> 21:50:45.852 | 0 |
> 21:50:45.859 | 19 |
> 21:50:45.875 | 6 |
> 21:50:45.876 | 7 |
> 21:50:45.881 | 0 |
> 21:50:45.883 | 14 |
> 21:50:45.886 | 1 |
> 21:50:45.890 | 4 |
> 21:50:45.891 | 3 |
> 21:50:45.894 | 1 |
> 21:50:45.902 | 1 |
> 21:50:45.906 | 4 |
> 21:50:45.908 | 3 |
> 21:50:45.918 | 18 |
> 21:50:45.921 | 10 |
>
> The DocumentCache in our case is not very useful because of our 1 minute
> replication pattern. I have it sized to 1024 elements now. When I've tried
> to increase the size, it caused our GC pause times to skyrocket. (Currently
> it's tuned so it has a 250ms GC pause roughly every 16 seconds - and I've
> verified that the above QTime bump is not due to GC activity)
>
> Is there something I can do to help with these? (short of increasing the
> replication time to a longer interval to mitigate the impact of these
> bumps)?
>
> I guess the real question I have is "Why do queries get faster after a
> second or so after replication? How can I try to get it to do that as part
> of the newSearcher warming??" - Like I said, I've copied some of those slow
> queries and put them into the newSearcher warming section to see if "well,
> maybe running through a few dozen of these searches that's what makes it
> get faster" - but that hasn't helped.
>
> Our index file is stored on disk but the OS basically has it all cached in
> RAM (I tested moving it to tmpfs but saw no improvement in speed - so I
> went back to putting it on disk) - and the CPU is not anywhere near taxed
> (it's got 24 cores on this machine)
>
> So far the performance of SOLR has been stellar - and once we finish tuning
> this we'll write up how we've tuned it / how we're using it to share out
> more widely with anyone who cares - but the one perplexing thing is this
> bump in bad times after replication.
>
> Any thoughts would be appreciated.
>
> Ryan
>