You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Allan Carroll <al...@hangtime.com> on 2014/02/18 02:12:10 UTC

Slow 95th-percentile

Hi all,

I'm having trouble getting my Solr setup to get consistent performance. Average select latency is great, but 95% is dismal (10x average). It's probably something slightly misconfigured. I’ve seen it have nice, low variance latencies for a few hours here and there, but can’t figure out what’s different during those times.


* I’m running 4.1.0 using SolrCloud. 3 replicas of 1 shard on 3 EC2 boxes (8proc, 30GB RAM, SSDs). Load peaks around 30 selects per second and about 150 updates per second. 

* The index has about 11GB of data in 14M docs, the other 10MB of data in 3K docs. Stays around 30 segments.

* Soft commits after 10 seconds, hard commits after 120 seconds. Though, turning off the update traffic doesn’t seem to have any affect on the select latencies.

* I think GC latency is low. Running 3GB heaps with 1G new size. GC time is around 3ms per second.
 

Here’s a typical select query:

fl=*,sortScore:textScore&sort=textScore desc&start=0&q=text:(("soccer" OR "MLS" OR "premier league" OR "FIFA" OR "world cup") OR ("sorority" OR "fraternity" OR "greek life" OR "dorm" OR "campus"))&wt=json&fq=startTime:[1392656400000 TO 1392717540000]&fq={!frange l=2 u=3}timeflag(startTime)&fq={!frange l=1392656400000 u=1392695940000 cache=false}timefix(startTime,-21600000)&fq=privacy:OPEN&defType=edismax&rows=131


Anyone have any suggestions on where to look next? Or, if you know someone in the bay area that would consult for an hour or two and help me track it down, that’d be great too.

Thanks!

-Allan

Re: Slow 95th-percentile

Posted by Allan Carroll <al...@hangtime.com>.

Thanks, Chris. Adding autoWarming to the filter cache made another big  improvement.

Between increasing the soft commit to 60s, fixing the q:* query, and autowarming the filter caches my 95% latencies are down to a very acceptable range — almost an order of magnitude improvement. :-)

-Allan

On February 18, 2014 at 5:32:51 PM, Chris Hostetter (hossman_lucene@fucit.org) wrote:


: Slowing the soft commits to every 100 seconds helped. The main culprit  
: was a bad query that was coming through every few seconds. Something  
: about the empty fq param and the q=* slowed everything else down.  
:  
: INFO: [event] webapp=/solr path=/select  
: params={start=0&q=*&wt=javabin&fq=&fq=startTime:1392836400003&version=2}  
: hits=1894 status=0 QTime=6943  

1) if you are using Solr 4.1 or earlier, then q=* is an expensive &  
useless query that doesn't mean what you think it does...  

https://issues.apache.org/jira/browse/SOLR-2996  

2) an empty "fq" doesn't cost anything -- if you use debugQuery=true you  
should see that it's not even included in "parsed_filter_queries" because  
it's totally ignored.  

3) if that "startTime" value changes at some fixed and regular  
interval, that could explain some anomoloies if it's normally the  
same and cached, but changes once a day/hour/minute or whatever and is a  
bit slow to cache.  


bottom line: a softCommit is going to re-open a searcher, which is going  
to wipe your caches. if you don't have any (auto)warming configured, that  
means any "fq"s, or "q"s that you run regularly are going to pay the  
price of being "slow" the first time they are run against a new searcher  
is opened.  

If your priority is low response time, you really want to open new  
searchers as infrequently as your SLA for visibility allows, and use  
(auto)warming for those common queries.  



-Hoss  
http://www.lucidworks.com/

Re: Slow 95th-percentile

Posted by Chris Hostetter <ho...@fucit.org>.

: Slowing the soft commits to every 100 seconds helped. The main culprit 
: was a bad query that was coming through every few seconds. Something 
: about the empty fq param and the q=* slowed everything else down.
: 
: INFO: [event] webapp=/solr path=/select 
: params={start=0&q=*&wt=javabin&fq=&fq=startTime:1392836400003&version=2} 
: hits=1894 status=0 QTime=6943

1) if you are using Solr 4.1 or earlier, then q=* is an expensive & 
useless query that doesn't mean what you think it does...

  https://issues.apache.org/jira/browse/SOLR-2996

2) an empty "fq" doesn't cost anything -- if you use debugQuery=true you 
should see that it's not even included in "parsed_filter_queries" because 
it's totally ignored.

3) if that "startTime" value changes at some fixed and regular 
interval, that could explain some anomoloies if it's normally the 
same and cached, but changes once a day/hour/minute or whatever and is a 
bit slow to cache.


bottom line: a softCommit is going to re-open a searcher, which is going 
to wipe your caches.  if you don't have any (auto)warming configured, that 
means any "fq"s, or "q"s that you run regularly are going to pay the 
price of being "slow" the first time they are run against a new searcher 
is opened.

If your priority is low response time, you really want to open new 
searchers as infrequently as your SLA for visibility allows, and use 
(auto)warming for those common queries.



-Hoss
http://www.lucidworks.com/

Re: Slow 95th-percentile

Posted by Allan Carroll <al...@hangtime.com>.

Slowing the soft commits to every 100 seconds helped. The main culprit was a bad query that was coming through every few seconds. Something about the empty fq param and the q=* slowed everything else down.

INFO: [event] webapp=/solr path=/select params={start=0&q=*&wt=javabin&fq=&fq=startTime:1392836400003&version=2} hits=1894 status=0 QTime=6943

Thanks for all your help.

-Allan

On February 18, 2014 at 12:24:37 PM, Shawn Heisey (solr@elyograg.org) wrote:

On 2/18/2014 11:51 AM, Allan Carroll wrote:
> I was thinking GC too, but it doesn’t feel like it is. Running jstat -gcutil only shows a 10-50ms parnew collection every 10 or 15 seconds and almost no full CMS collections. Anything other places to look for GC activity I might be missing?
>
> I did a little investigation this morning and found that if I run a query once a second, every 10th query is slow. Looks suspiciously like the soft commits are causing the slow downs. I could make it further in between. Anything else I can look at to make those commits less costly?

It does indeed sound like the 10 second soft commit is the problem. The
"opening a new searcher" part of a commit tends to be fairly expensive.
The impact is even greater when combined with flushing data to disk,
which is why soft commits can be faster than hard commits ... but
building a new searcher is not cheap even then.

Do you have autoCommit configured, with openSearcher=false? If not, you
should.

If you are using Solr caches, reducing (or eliminating) the
autowarmCount values on each cache (particularly the filterCache) can
make commits happen quite a lot faster. With a commit potentially
happening every ten seconds, you might want to configure those caches so
they are pretty small. Frequent commits mean that the caches are
frequently invalidated. If commit frequency is high and autowarmCount
values are low, a large cache is just a waste of memory. The cache
config was the main thing I was interested in seeing when I asked for
solrconfig.xml.

You have a lot of GC tuning going on, which is good - untuned GC and
Solr do NOT get along. I'll just show you what I use and let you make
your own decision.

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn

Re: Slow 95th-percentile

Posted by Shawn Heisey <so...@elyograg.org>.

On 2/18/2014 11:51 AM, Allan Carroll wrote:
>   I was thinking GC too, but it doesn’t feel like it is. Running jstat -gcutil only shows a 10-50ms parnew collection every 10 or 15 seconds and almost no full CMS collections. Anything other places to look for GC activity I might be missing?
>
> I did a little investigation this morning and found that if I run a query once a second, every 10th query is slow. Looks suspiciously like the soft commits are causing the slow downs. I could make it further in between. Anything else I can look at to make those commits less costly?

It does indeed sound like the 10 second soft commit is the problem.  The 
"opening a new searcher" part of a commit tends to be fairly expensive.  
The impact is even greater when combined with flushing data to disk, 
which is why soft commits can be faster than hard commits ... but 
building a new searcher is not cheap even then.

Do you have autoCommit configured, with openSearcher=false?  If not, you 
should.

If you are using Solr caches, reducing (or eliminating) the 
autowarmCount values on each cache (particularly the filterCache) can 
make commits happen quite a lot faster.  With a commit potentially 
happening every ten seconds, you might want to configure those caches so 
they are pretty small.  Frequent commits mean that the caches are 
frequently invalidated.  If commit frequency is high and autowarmCount 
values are low, a large cache is just a waste of memory.  The cache 
config was the main thing I was interested in seeing when I asked for 
solrconfig.xml.

You have a lot of GC tuning going on, which is good - untuned GC and 
Solr do NOT get along.  I'll just show you what I use and let you make 
your own decision.

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn

Re: Slow 95th-percentile

Posted by Allan Carroll <al...@hangtime.com>.

Thanks for the suggestions.

 I was thinking GC too, but it doesn’t feel like it is. Running jstat -gcutil only shows a 10-50ms parnew collection every 10 or 15 seconds and almost no full CMS collections. Anything other places to look for GC activity I might be missing?

I did a little investigation this morning and found that if I run a query once a second, every 10th query is slow. Looks suspiciously like the soft commits are causing the slow downs. I could make it further in between. Anything else I can look at to make those commits less costly?

Here are the java options:

-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Xmx3G -Xms3G -Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Xloggc:/var/log/tomcat7/gc-tomcat.log -verbose:gc -XX:GCLogFileSize=10M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:-PrintGCApplicationStoppedTime -DzkHost=xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181/solr -Dcom.sun.management.jmxremote -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/usr/share/tomcat7/endorsed

I’m using tomcat, though I’ve heard that jetty can be a better choice.

I’ve also attached my solrconfig.

-Allan

On February 17, 2014 at 6:06:03 PM, Shawn Heisey (solr@elyograg.org) wrote:

On 2/17/2014 6:12 PM, Allan Carroll wrote:  
> I'm having trouble getting my Solr setup to get consistent performance. Average select latency is great, but 95% is dismal (10x average). It's probably something slightly misconfigured. I’ve seen it have nice, low variance latencies for a few hours here and there, but can’t figure out what’s different during those times.  
>  
>  
> * I’m running 4.1.0 using SolrCloud. 3 replicas of 1 shard on 3 EC2 boxes (8proc, 30GB RAM, SSDs). Load peaks around 30 selects per second and about 150 updates per second.  
>  
> * The index has about 11GB of data in 14M docs, the other 10MB of data in 3K docs. Stays around 30 segments.  
>  
> * Soft commits after 10 seconds, hard commits after 120 seconds. Though, turning off the update traffic doesn’t seem to have any affect on the select latencies.  
>  
> * I think GC latency is low. Running 3GB heaps with 1G new size. GC time is around 3ms per second.  
>  
>  
> Here’s a typical select query:  
>  
> fl=*,sortScore:textScore&sort=textScore desc&start=0&q=text:(("soccer" OR "MLS" OR "premier league" OR "FIFA" OR "world cup") OR ("sorority" OR "fraternity" OR "greek life" OR "dorm" OR "campus"))&wt=json&fq=startTime:[1392656400000 TO 1392717540000]&fq={!frange l=2 u=3}timeflag(startTime)&fq={!frange l=1392656400000 u=1392695940000 cache=false}timefix(startTime,-21600000)&fq=privacy:OPEN&defType=edismax&rows=131  

The first thing to say is that it's fairly normal for the 95th and 99th  
percentile values to be quite a lot higher than the median and average  
values. I don't have actual values so I don't know if it's bad or not.  

You're good on the most important performance-related resource, which is  
memory for the OS disk cache. The only thing that stands out as a  
possible problem from what I know so far is garbage collection. It  
might be a case of full garbage collections happening too frequently, or  
it might be a case of garbage collection pauses taking too long. It  
might even be a combination of both.  

To fix frequent full collections, increase the heap size. To fix the  
other problem, use the CMS collector and tune it.  

Two bits of information will help with recommendations: Your java  
startup options, and your solrconfig.xml.  

You're using an option in your query that I've never seen before. I  
don't know if frange is slow or not.  

One last thing that might cause problems is super-frequent commits.  

I could also be completely wrong!  

Thanks,  
Shawn

Re: Slow 95th-percentile

Posted by Shawn Heisey <so...@elyograg.org>.

On 2/17/2014 6:12 PM, Allan Carroll wrote:
> I'm having trouble getting my Solr setup to get consistent performance. Average select latency is great, but 95% is dismal (10x average). It's probably something slightly misconfigured. I’ve seen it have nice, low variance latencies for a few hours here and there, but can’t figure out what’s different during those times.
> 
> 
> * I’m running 4.1.0 using SolrCloud. 3 replicas of 1 shard on 3 EC2 boxes (8proc, 30GB RAM, SSDs). Load peaks around 30 selects per second and about 150 updates per second. 
> 
> * The index has about 11GB of data in 14M docs, the other 10MB of data in 3K docs. Stays around 30 segments.
> 
> * Soft commits after 10 seconds, hard commits after 120 seconds. Though, turning off the update traffic doesn’t seem to have any affect on the select latencies.
> 
> * I think GC latency is low. Running 3GB heaps with 1G new size. GC time is around 3ms per second.
>  
> 
> Here’s a typical select query:
> 
> fl=*,sortScore:textScore&sort=textScore desc&start=0&q=text:(("soccer" OR "MLS" OR "premier league" OR "FIFA" OR "world cup") OR ("sorority" OR "fraternity" OR "greek life" OR "dorm" OR "campus"))&wt=json&fq=startTime:[1392656400000 TO 1392717540000]&fq={!frange l=2 u=3}timeflag(startTime)&fq={!frange l=1392656400000 u=1392695940000 cache=false}timefix(startTime,-21600000)&fq=privacy:OPEN&defType=edismax&rows=131

The first thing to say is that it's fairly normal for the 95th and 99th
percentile values to be quite a lot higher than the median and average
values.  I don't have actual values so I don't know if it's bad or not.

You're good on the most important performance-related resource, which is
memory for the OS disk cache.  The only thing that stands out as a
possible problem from what I know so far is garbage collection.  It
might be a case of full garbage collections happening too frequently, or
it might be a case of garbage collection pauses taking too long.  It
might even be a combination of both.

To fix frequent full collections, increase the heap size.  To fix the
other problem, use the CMS collector and tune it.

Two bits of information will help with recommendations: Your java
startup options, and your solrconfig.xml.

You're using an option in your query that I've never seen before.  I
don't know if frange is slow or not.

One last thing that might cause problems is super-frequent commits.

I could also be completely wrong!

Thanks,
Shawn