You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by so...@isshomefront.com on 2012/06/29 01:56:25 UTC

Strange "spikes" in query response times...any ideas where else to look?

Greetings all,

We are working on building up a large Solr index for over 300 million  
records...and this is our first look at Solr. We are currently running  
a set of unique search queries against a single server (so no  
replication, no indexing going on at the same time, and no distributed  
search) with a set number of records (in our case, 10 million records  
in the index) for about 30 minutes, with nearly all of our searches  
being unique (I say "nearly" because our set of queries is unique, but  
I have not yet confirmed that JMeter is selecting these queries with  
no replacement).

We are striving for a 2 second response time on the average, and  
indeed we are pretty darned close. In fact, if you look at the average  
responses time, we are well under the 2 seconds per query.  
Unfortunately, we are seeing that about once every 6 minutes or so  
(and it is not a regular event...exactly six minutes apart...it is  
"about" six minutes but it fluctuates) we get a single query that  
returns in something like 15 to 20 seconds

We have been trying to identify what is causing this "spike" every so  
often and we are completely baffled. What we have done thus far:

1) Looked through the SAR logs and have not seen anything that  
correlates to this issue
2) Tracked the JVM statistics...especially the garbage  
collections...no correlations there either
3) Examined the queries...no pattern obvious there
4) Played with the JVM memory settings (heap settings, cache settings,  
and any other settings we could find)
5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a  
fresh install of Redhat 5.7 enterprise, tried on a large instance of  
AWS EC2, tried on a fresh instance of a VMWare based virtual machine  
from our own data center) an still nothing is giving us a clue as to  
what is causing these "spikes"
5) No correlation found between the number of hits returned and the spikes


Our data is very simple and so are the queries. The schema consists of  
40 fields, most of which are "string" fields, 2 of which are  
"location" fields, and a small handful of which are integer fields.  
All fields are indexed and all fields are stored.

Our queries are also rather simple. Many of the queries are a simple  
one-field search. The most complex query we have is a 3-field search.  
Again, no correlation has been established between the query and these  
spikes. Also, about 60% of our queries return zero hits (on the  
assumption that we want to make solr search its entire index every so  
often. 60% is more than we intended and we will fix that soon...but  
that is what is currently happening. Again, no correlation found  
between spikes and 0-hit returned queries).

For some time we were testing with 100 million records in the index  
and the aggregate data looked quite good. Most queries were returning  
in under 2 seconds. Unfortunately, it was when we looked at the  
individual data points that we found spikes every 6-8 minutes or so  
hitting sometimes as high as 150 seconds!

We have been testing with 100 million records in the index, 50 million  
records in the index, 25 million, 20 million, 15 million, and 10  
million records. As I  indicated at the start, we are now at 10  
million records with 15-20 seconds spikes.

As we have decreased the number of records in the index,the size (but  
not the frequency) of the spikes has been dropping.

My question is: Is this type of behavior normal for Solr when it is  
being overstressed? I've read of lots of people with far more  
complicated schemas running MORE than 10 million records in an index  
and never once complained about these spikes. Since I am new at this,  
I am not sure what Solr's "failure mode" looks like when it has too  
many records to search.

I am hoping someone looking at this note can at least give me another  
direction to look. 10 million records searched in less than 2 seconds  
most of the time is great...but those 10 and 20 seconds spikes are not  
going to go over well with our customers...and I somehow think there  
is more we should be able to do here.

Thanks.

Peter S. Lee
ProQuest


Re: Strange "spikes" in query response times...any ideas where else to look?

Posted by so...@isshomefront.com.
Otis,

Thanks for the response. We'll check out that tool and see how it goes.

Regarding JMeter...you are exactly correct in that I was assuming 1  
thread = 1 query per second. I thought we had set up some sort of  
throttling mechanism to ensure that...and clearly I was mistaken. By  
the math we are getting A LOT more qps...and in a preliminary look  
those spikes look like they just might be correlated to high qps. We  
are pursuing this line and my gut tells me this *is* the problem.

Thanks for the info on the tool (which we will look at) and for the  
heads-up on the qps.


Peter Lee
ProQuest

Quoting Otis Gospodnetic <ot...@yahoo.com>:

> Peter,
>
> These could be JVM, or it could be index reopening and warmup  
> queries, or .... 
> Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll  
> release an agent that tracks and graphs errors and timings of each  
> Solr search component, which may reveal interesting stuff.  In the  
> mean time, look at the graph with IO as well as graph with caches.  
>  That's where I'd first look for signs.
>
> Re users/threads question - if I understand correctly, this is the  
> problem: " JMeter is set up to run 15 threads from a single test  
> machine...but I noticed that the JMeter report is showing close to  
> 47 queries per second".  It sounds like you re equating # of threads  
> to QPS, which isn't right.  Imagine you had 10 threads and each  
> query took 0.1 seconds (processed by a single CPU core) and the  
> server had 10 CPU cores.  That would mean that your 1 thread could  
> run 10 queries per second utilizing just 1 CPU core. And 10 threads  
> would utilize all 10 CPU cores and would give you 10x higher  
> throughput - 10x10=100 QPS.
>
> So if you need to simulate just 2-5 QPS, just lower the number of  
> threads.  What that number should be depends on query complexity and  
> hw resources (cores or IO).
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -  
> http://sematext.com/spm 
>
>
>
>> ________________________________
>> From: "solr@isshomefront.com" <so...@isshomefront.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, June 28, 2012 9:20 PM
>> Subject: RE: Strange "spikes" in query response times...any ideas  
>> where else to look?
>>
>> Michael,
>>
>> Thank you for responding...and for the excellent questions.
>>
>> 1) We have never seen this response time spike with a  
>> user-interactive search. However, in the span of about 40 minutes,  
>> which included about 82,000 queries, we only saw a handful of  
>> near-equally distributed "spikes". We have tried sending queries  
>> from the admin tool while the test was running, but given those  
>> odds, I'm not surprised we've never "hit" on one of those few  
>> spikes we are seeing in the test results.
>>
>> 2) Good point and I should have mentioned this. We are using  
>> multiple methods to track these response times.
>>   a) Looking at the catalina.out file and plotting the response  
>> times recorded there (I think this is logging the QTime as seen by  
>> Solr).
>>   b) Looking at what JMeter is reporting as response times. In  
>> general, these are very close if not identical to what is being  
>> seen in the Catalina.out file. I have not run a line-by-line  
>> comparison, but putting the query response graphs next to each  
>> other shows them to be nearly (or possibly exactly) the same.  
>> Nothing looked out of the ordinary.
>>
>> 3) We are using multiple threads. Before your email I was looking  
>> at the results, doing some math, and double checking the reports  
>> from JMeter. I did notice that our throughput is much higher than  
>> we meant for it to be. JMeter is set up to run 15 threads from a  
>> single test machine...but I noticed that the JMeter report is  
>> showing close to 47 queries per second. We are only targeting TWO  
>> to FIVE queries per second. This is up next on our list of things  
>> to look at and how to control more effectively. We do have three  
>> separate machines set up for JMeter testing and we are  
>> investigating to see if perhaps all three of these machines are  
>> inadvertently being launched during the test at one time and  
>> overwhelming the server. This *might* be one facet of the problem.  
>> Agreed on that.
>>
>> Even as we investigate this last item regarding the number of  
>> users/threads, I wouldn't mind any other thoughts you or anyone  
>> else had to offer. We are checking on this user/threads issue and  
>> for the sake of anyone else you finds this discussion useful I'll  
>> note what we find.
>>
>> Thanks again.
>>
>> Peter S. Lee
>> ProQuest
>>
>> Quoting Michael Ryan <mr...@moreover.com>:
>>
>>> A few questions...
>>>
>>> 1) Do you only see these spikes when running JMeter? I.e., do you  
>>> ever see a spike when you manually run a query?
>>>
>>> 2) How are you measuring the response time? In my experience there  
>>> are three different ways to measure query speed. Usually all of  
>>> them will be approximately equal, but in some situations they can  
>>> be quite different, and this difference can be a clue as to where  
>>> the bottleneck is:
>>>    1) The response time as seen by the end user (in this case, JMeter)
>>>    2) The response time as seen by the container (for example, in  
>>> Jetty you can get this by enabling logLatency in jetty.xml)
>>>    3) The "QTime" as returned in the Solr response
>>>
>>> 3) Are you running multiple queries concurrently, or are you just  
>>> using a single thread in JMeter?
>>>
>>> -Michael
>>>
>>> -----Original Message-----
>>> From: solr@isshomefront.com [mailto:solr@isshomefront.com]
>>> Sent: Thursday, June 28, 2012 7:56 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Strange "spikes" in query response times...any ideas  
>>> where else to look?
>>>
>>> Greetings all,
>>>
>>> We are working on building up a large Solr index for over 300 million
>>> records...and this is our first look at Solr. We are currently running
>>> a set of unique search queries against a single server (so no
>>> replication, no indexing going on at the same time, and no distributed
>>> search) with a set number of records (in our case, 10 million records
>>> in the index) for about 30 minutes, with nearly all of our searches
>>> being unique (I say "nearly" because our set of queries is unique, but
>>> I have not yet confirmed that JMeter is selecting these queries with
>>> no replacement).
>>>
>>> We are striving for a 2 second response time on the average, and
>>> indeed we are pretty darned close. In fact, if you look at the average
>>> responses time, we are well under the 2 seconds per query.
>>> Unfortunately, we are seeing that about once every 6 minutes or so
>>> (and it is not a regular event...exactly six minutes apart...it is
>>> "about" six minutes but it fluctuates) we get a single query that
>>> returns in something like 15 to 20 seconds
>>>
>>> We have been trying to identify what is causing this "spike" every so
>>> often and we are completely baffled. What we have done thus far:
>>>
>>> 1) Looked through the SAR logs and have not seen anything that
>>> correlates to this issue
>>> 2) Tracked the JVM statistics...especially the garbage
>>> collections...no correlations there either
>>> 3) Examined the queries...no pattern obvious there
>>> 4) Played with the JVM memory settings (heap settings, cache settings,
>>> and any other settings we could find)
>>> 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a
>>> fresh install of Redhat 5.7 enterprise, tried on a large instance of
>>> AWS EC2, tried on a fresh instance of a VMWare based virtual machine
>>> from our own data center) an still nothing is giving us a clue as to
>>> what is causing these "spikes"
>>> 5) No correlation found between the number of hits returned and the spikes
>>>
>>>
>>> Our data is very simple and so are the queries. The schema consists of
>>> 40 fields, most of which are "string" fields, 2 of which are
>>> "location" fields, and a small handful of which are integer fields.
>>> All fields are indexed and all fields are stored.
>>>
>>> Our queries are also rather simple. Many of the queries are a simple
>>> one-field search. The most complex query we have is a 3-field search.
>>> Again, no correlation has been established between the query and these
>>> spikes. Also, about 60% of our queries return zero hits (on the
>>> assumption that we want to make solr search its entire index every so
>>> often. 60% is more than we intended and we will fix that soon...but
>>> that is what is currently happening. Again, no correlation found
>>> between spikes and 0-hit returned queries).
>>>
>>> For some time we were testing with 100 million records in the index
>>> and the aggregate data looked quite good. Most queries were returning
>>> in under 2 seconds. Unfortunately, it was when we looked at the
>>> individual data points that we found spikes every 6-8 minutes or so
>>> hitting sometimes as high as 150 seconds!
>>>
>>> We have been testing with 100 million records in the index, 50 million
>>> records in the index, 25 million, 20 million, 15 million, and 10
>>> million records. As I  indicated at the start, we are now at 10
>>> million records with 15-20 seconds spikes.
>>>
>>> As we have decreased the number of records in the index,the size (but
>>> not the frequency) of the spikes has been dropping.
>>>
>>> My question is: Is this type of behavior normal for Solr when it is
>>> being overstressed? I've read of lots of people with far more
>>> complicated schemas running MORE than 10 million records in an index
>>> and never once complained about these spikes. Since I am new at this,
>>> I am not sure what Solr's "failure mode" looks like when it has too
>>> many records to search.
>>>
>>> I am hoping someone looking at this note can at least give me another
>>> direction to look. 10 million records searched in less than 2 seconds
>>> most of the time is great...but those 10 and 20 seconds spikes are not
>>> going to go over well with our customers...and I somehow think there
>>> is more we should be able to do here.
>>>
>>> Thanks.
>>>
>>> Peter S. Lee
>>> ProQuest
>>>
>>>
>>
>>
>>
>>
>>
>>



Re: Strange "spikes" in query response times...any ideas where else to look?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Peter,

These could be JVM, or it could be index reopening and warmup queries, or .... 
Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll release an agent that tracks and graphs errors and timings of each Solr search component, which may reveal interesting stuff.  In the mean time, look at the graph with IO as well as graph with caches.  That's where I'd first look for signs.

Re users/threads question - if I understand correctly, this is the problem: " JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second".  It sounds like you re equating # of threads to QPS, which isn't right.  Imagine you had 10 threads and each query took 0.1 seconds (processed by a single CPU core) and the server had 10 CPU cores.  That would mean that your 1 thread could run 10 queries per second utilizing just 1 CPU core. And 10 threads would utilize all 10 CPU cores and would give you 10x higher throughput - 10x10=100 QPS.

So if you need to simulate just 2-5 QPS, just lower the number of threads.  What that number should be depends on query complexity and hw resources (cores or IO).

Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: "solr@isshomefront.com" <so...@isshomefront.com>
>To: solr-user@lucene.apache.org 
>Sent: Thursday, June 28, 2012 9:20 PM
>Subject: RE: Strange "spikes" in query response times...any ideas where else to look?
> 
>Michael,
>
>Thank you for responding...and for the excellent questions.
>
>1) We have never seen this response time spike with a user-interactive search. However, in the span of about 40 minutes, which included about 82,000 queries, we only saw a handful of near-equally distributed "spikes". We have tried sending queries from the admin tool while the test was running, but given those odds, I'm not surprised we've never "hit" on one of those few spikes we are seeing in the test results.
>
>2) Good point and I should have mentioned this. We are using multiple methods to track these response times.
>  a) Looking at the catalina.out file and plotting the response times recorded there (I think this is logging the QTime as seen by Solr).
>  b) Looking at what JMeter is reporting as response times. In general, these are very close if not identical to what is being seen in the Catalina.out file. I have not run a line-by-line comparison, but putting the query response graphs next to each other shows them to be nearly (or possibly exactly) the same. Nothing looked out of the ordinary.
>
>3) We are using multiple threads. Before your email I was looking at the results, doing some math, and double checking the reports from JMeter. I did notice that our throughput is much higher than we meant for it to be. JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. We are only targeting TWO to FIVE queries per second. This is up next on our list of things to look at and how to control more effectively. We do have three separate machines set up for JMeter testing and we are investigating to see if perhaps all three of these machines are inadvertently being launched during the test at one time and overwhelming the server. This *might* be one facet of the problem. Agreed on that.
>
>Even as we investigate this last item regarding the number of users/threads, I wouldn't mind any other thoughts you or anyone else had to offer. We are checking on this user/threads issue and for the sake of anyone else you finds this discussion useful I'll note what we find.
>
>Thanks again.
>
>Peter S. Lee
>ProQuest
>
>Quoting Michael Ryan <mr...@moreover.com>:
>
>> A few questions...
>> 
>> 1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query?
>> 
>> 2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is:
>>   1) The response time as seen by the end user (in this case, JMeter)
>>   2) The response time as seen by the container (for example, in Jetty you can get this by enabling logLatency in jetty.xml)
>>   3) The "QTime" as returned in the Solr response
>> 
>> 3) Are you running multiple queries concurrently, or are you just using a single thread in JMeter?
>> 
>> -Michael
>> 
>> -----Original Message-----
>> From: solr@isshomefront.com [mailto:solr@isshomefront.com]
>> Sent: Thursday, June 28, 2012 7:56 PM
>> To: solr-user@lucene.apache.org
>> Subject: Strange "spikes" in query response times...any ideas where else to look?
>> 
>> Greetings all,
>> 
>> We are working on building up a large Solr index for over 300 million
>> records...and this is our first look at Solr. We are currently running
>> a set of unique search queries against a single server (so no
>> replication, no indexing going on at the same time, and no distributed
>> search) with a set number of records (in our case, 10 million records
>> in the index) for about 30 minutes, with nearly all of our searches
>> being unique (I say "nearly" because our set of queries is unique, but
>> I have not yet confirmed that JMeter is selecting these queries with
>> no replacement).
>> 
>> We are striving for a 2 second response time on the average, and
>> indeed we are pretty darned close. In fact, if you look at the average
>> responses time, we are well under the 2 seconds per query.
>> Unfortunately, we are seeing that about once every 6 minutes or so
>> (and it is not a regular event...exactly six minutes apart...it is
>> "about" six minutes but it fluctuates) we get a single query that
>> returns in something like 15 to 20 seconds
>> 
>> We have been trying to identify what is causing this "spike" every so
>> often and we are completely baffled. What we have done thus far:
>> 
>> 1) Looked through the SAR logs and have not seen anything that
>> correlates to this issue
>> 2) Tracked the JVM statistics...especially the garbage
>> collections...no correlations there either
>> 3) Examined the queries...no pattern obvious there
>> 4) Played with the JVM memory settings (heap settings, cache settings,
>> and any other settings we could find)
>> 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a
>> fresh install of Redhat 5.7 enterprise, tried on a large instance of
>> AWS EC2, tried on a fresh instance of a VMWare based virtual machine
>> from our own data center) an still nothing is giving us a clue as to
>> what is causing these "spikes"
>> 5) No correlation found between the number of hits returned and the spikes
>> 
>> 
>> Our data is very simple and so are the queries. The schema consists of
>> 40 fields, most of which are "string" fields, 2 of which are
>> "location" fields, and a small handful of which are integer fields.
>> All fields are indexed and all fields are stored.
>> 
>> Our queries are also rather simple. Many of the queries are a simple
>> one-field search. The most complex query we have is a 3-field search.
>> Again, no correlation has been established between the query and these
>> spikes. Also, about 60% of our queries return zero hits (on the
>> assumption that we want to make solr search its entire index every so
>> often. 60% is more than we intended and we will fix that soon...but
>> that is what is currently happening. Again, no correlation found
>> between spikes and 0-hit returned queries).
>> 
>> For some time we were testing with 100 million records in the index
>> and the aggregate data looked quite good. Most queries were returning
>> in under 2 seconds. Unfortunately, it was when we looked at the
>> individual data points that we found spikes every 6-8 minutes or so
>> hitting sometimes as high as 150 seconds!
>> 
>> We have been testing with 100 million records in the index, 50 million
>> records in the index, 25 million, 20 million, 15 million, and 10
>> million records. As I  indicated at the start, we are now at 10
>> million records with 15-20 seconds spikes.
>> 
>> As we have decreased the number of records in the index,the size (but
>> not the frequency) of the spikes has been dropping.
>> 
>> My question is: Is this type of behavior normal for Solr when it is
>> being overstressed? I've read of lots of people with far more
>> complicated schemas running MORE than 10 million records in an index
>> and never once complained about these spikes. Since I am new at this,
>> I am not sure what Solr's "failure mode" looks like when it has too
>> many records to search.
>> 
>> I am hoping someone looking at this note can at least give me another
>> direction to look. 10 million records searched in less than 2 seconds
>> most of the time is great...but those 10 and 20 seconds spikes are not
>> going to go over well with our customers...and I somehow think there
>> is more we should be able to do here.
>> 
>> Thanks.
>> 
>> Peter S. Lee
>> ProQuest
>> 
>> 
>
>
>
>
>
>

RE: Strange "spikes" in query response times...any ideas where else to look?

Posted by so...@isshomefront.com.
Michael,

Thank you for responding...and for the excellent questions.

1) We have never seen this response time spike with a user-interactive  
search. However, in the span of about 40 minutes, which included about  
82,000 queries, we only saw a handful of near-equally distributed  
"spikes". We have tried sending queries from the admin tool while the  
test was running, but given those odds, I'm not surprised we've never  
"hit" on one of those few spikes we are seeing in the test results.

2) Good point and I should have mentioned this. We are using multiple  
methods to track these response times.
   a) Looking at the catalina.out file and plotting the response times  
recorded there (I think this is logging the QTime as seen by Solr).
   b) Looking at what JMeter is reporting as response times. In  
general, these are very close if not identical to what is being seen  
in the Catalina.out file. I have not run a line-by-line comparison,  
but putting the query response graphs next to each other shows them to  
be nearly (or possibly exactly) the same. Nothing looked out of the  
ordinary.

3) We are using multiple threads. Before your email I was looking at  
the results, doing some math, and double checking the reports from  
JMeter. I did notice that our throughput is much higher than we meant  
for it to be. JMeter is set up to run 15 threads from a single test  
machine...but I noticed that the JMeter report is showing close to 47  
queries per second. We are only targeting TWO to FIVE queries per  
second. This is up next on our list of things to look at and how to  
control more effectively. We do have three separate machines set up  
for JMeter testing and we are investigating to see if perhaps all  
three of these machines are inadvertently being launched during the  
test at one time and overwhelming the server. This *might* be one  
facet of the problem. Agreed on that.

Even as we investigate this last item regarding the number of  
users/threads, I wouldn't mind any other thoughts you or anyone else  
had to offer. We are checking on this user/threads issue and for the  
sake of anyone else you finds this discussion useful I'll note what we  
find.

Thanks again.

Peter S. Lee
ProQuest

Quoting Michael Ryan <mr...@moreover.com>:

> A few questions...
>
> 1) Do you only see these spikes when running JMeter? I.e., do you  
> ever see a spike when you manually run a query?
>
> 2) How are you measuring the response time? In my experience there  
> are three different ways to measure query speed. Usually all of them  
> will be approximately equal, but in some situations they can be  
> quite different, and this difference can be a clue as to where the  
> bottleneck is:
>   1) The response time as seen by the end user (in this case, JMeter)
>   2) The response time as seen by the container (for example, in  
> Jetty you can get this by enabling logLatency in jetty.xml)
>   3) The "QTime" as returned in the Solr response
>
> 3) Are you running multiple queries concurrently, or are you just  
> using a single thread in JMeter?
>
> -Michael
>
> -----Original Message-----
> From: solr@isshomefront.com [mailto:solr@isshomefront.com]
> Sent: Thursday, June 28, 2012 7:56 PM
> To: solr-user@lucene.apache.org
> Subject: Strange "spikes" in query response times...any ideas where  
> else to look?
>
> Greetings all,
>
> We are working on building up a large Solr index for over 300 million
> records...and this is our first look at Solr. We are currently running
> a set of unique search queries against a single server (so no
> replication, no indexing going on at the same time, and no distributed
> search) with a set number of records (in our case, 10 million records
> in the index) for about 30 minutes, with nearly all of our searches
> being unique (I say "nearly" because our set of queries is unique, but
> I have not yet confirmed that JMeter is selecting these queries with
> no replacement).
>
> We are striving for a 2 second response time on the average, and
> indeed we are pretty darned close. In fact, if you look at the average
> responses time, we are well under the 2 seconds per query.
> Unfortunately, we are seeing that about once every 6 minutes or so
> (and it is not a regular event...exactly six minutes apart...it is
> "about" six minutes but it fluctuates) we get a single query that
> returns in something like 15 to 20 seconds
>
> We have been trying to identify what is causing this "spike" every so
> often and we are completely baffled. What we have done thus far:
>
> 1) Looked through the SAR logs and have not seen anything that
> correlates to this issue
> 2) Tracked the JVM statistics...especially the garbage
> collections...no correlations there either
> 3) Examined the queries...no pattern obvious there
> 4) Played with the JVM memory settings (heap settings, cache settings,
> and any other settings we could find)
> 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a
> fresh install of Redhat 5.7 enterprise, tried on a large instance of
> AWS EC2, tried on a fresh instance of a VMWare based virtual machine
> from our own data center) an still nothing is giving us a clue as to
> what is causing these "spikes"
> 5) No correlation found between the number of hits returned and the spikes
>
>
> Our data is very simple and so are the queries. The schema consists of
> 40 fields, most of which are "string" fields, 2 of which are
> "location" fields, and a small handful of which are integer fields.
> All fields are indexed and all fields are stored.
>
> Our queries are also rather simple. Many of the queries are a simple
> one-field search. The most complex query we have is a 3-field search.
> Again, no correlation has been established between the query and these
> spikes. Also, about 60% of our queries return zero hits (on the
> assumption that we want to make solr search its entire index every so
> often. 60% is more than we intended and we will fix that soon...but
> that is what is currently happening. Again, no correlation found
> between spikes and 0-hit returned queries).
>
> For some time we were testing with 100 million records in the index
> and the aggregate data looked quite good. Most queries were returning
> in under 2 seconds. Unfortunately, it was when we looked at the
> individual data points that we found spikes every 6-8 minutes or so
> hitting sometimes as high as 150 seconds!
>
> We have been testing with 100 million records in the index, 50 million
> records in the index, 25 million, 20 million, 15 million, and 10
> million records. As I  indicated at the start, we are now at 10
> million records with 15-20 seconds spikes.
>
> As we have decreased the number of records in the index,the size (but
> not the frequency) of the spikes has been dropping.
>
> My question is: Is this type of behavior normal for Solr when it is
> being overstressed? I've read of lots of people with far more
> complicated schemas running MORE than 10 million records in an index
> and never once complained about these spikes. Since I am new at this,
> I am not sure what Solr's "failure mode" looks like when it has too
> many records to search.
>
> I am hoping someone looking at this note can at least give me another
> direction to look. 10 million records searched in less than 2 seconds
> most of the time is great...but those 10 and 20 seconds spikes are not
> going to go over well with our customers...and I somehow think there
> is more we should be able to do here.
>
> Thanks.
>
> Peter S. Lee
> ProQuest
>
>




RE: Strange "spikes" in query response times...any ideas where else to look?

Posted by Michael Ryan <mr...@moreover.com>.
A few questions...

1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query?

2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is:
  1) The response time as seen by the end user (in this case, JMeter)
  2) The response time as seen by the container (for example, in Jetty you can get this by enabling logLatency in jetty.xml)
  3) The "QTime" as returned in the Solr response

3) Are you running multiple queries concurrently, or are you just using a single thread in JMeter?

-Michael

-----Original Message-----
From: solr@isshomefront.com [mailto:solr@isshomefront.com] 
Sent: Thursday, June 28, 2012 7:56 PM
To: solr-user@lucene.apache.org
Subject: Strange "spikes" in query response times...any ideas where else to look?

Greetings all,

We are working on building up a large Solr index for over 300 million  
records...and this is our first look at Solr. We are currently running  
a set of unique search queries against a single server (so no  
replication, no indexing going on at the same time, and no distributed  
search) with a set number of records (in our case, 10 million records  
in the index) for about 30 minutes, with nearly all of our searches  
being unique (I say "nearly" because our set of queries is unique, but  
I have not yet confirmed that JMeter is selecting these queries with  
no replacement).

We are striving for a 2 second response time on the average, and  
indeed we are pretty darned close. In fact, if you look at the average  
responses time, we are well under the 2 seconds per query.  
Unfortunately, we are seeing that about once every 6 minutes or so  
(and it is not a regular event...exactly six minutes apart...it is  
"about" six minutes but it fluctuates) we get a single query that  
returns in something like 15 to 20 seconds

We have been trying to identify what is causing this "spike" every so  
often and we are completely baffled. What we have done thus far:

1) Looked through the SAR logs and have not seen anything that  
correlates to this issue
2) Tracked the JVM statistics...especially the garbage  
collections...no correlations there either
3) Examined the queries...no pattern obvious there
4) Played with the JVM memory settings (heap settings, cache settings,  
and any other settings we could find)
5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a  
fresh install of Redhat 5.7 enterprise, tried on a large instance of  
AWS EC2, tried on a fresh instance of a VMWare based virtual machine  
from our own data center) an still nothing is giving us a clue as to  
what is causing these "spikes"
5) No correlation found between the number of hits returned and the spikes


Our data is very simple and so are the queries. The schema consists of  
40 fields, most of which are "string" fields, 2 of which are  
"location" fields, and a small handful of which are integer fields.  
All fields are indexed and all fields are stored.

Our queries are also rather simple. Many of the queries are a simple  
one-field search. The most complex query we have is a 3-field search.  
Again, no correlation has been established between the query and these  
spikes. Also, about 60% of our queries return zero hits (on the  
assumption that we want to make solr search its entire index every so  
often. 60% is more than we intended and we will fix that soon...but  
that is what is currently happening. Again, no correlation found  
between spikes and 0-hit returned queries).

For some time we were testing with 100 million records in the index  
and the aggregate data looked quite good. Most queries were returning  
in under 2 seconds. Unfortunately, it was when we looked at the  
individual data points that we found spikes every 6-8 minutes or so  
hitting sometimes as high as 150 seconds!

We have been testing with 100 million records in the index, 50 million  
records in the index, 25 million, 20 million, 15 million, and 10  
million records. As I  indicated at the start, we are now at 10  
million records with 15-20 seconds spikes.

As we have decreased the number of records in the index,the size (but  
not the frequency) of the spikes has been dropping.

My question is: Is this type of behavior normal for Solr when it is  
being overstressed? I've read of lots of people with far more  
complicated schemas running MORE than 10 million records in an index  
and never once complained about these spikes. Since I am new at this,  
I am not sure what Solr's "failure mode" looks like when it has too  
many records to search.

I am hoping someone looking at this note can at least give me another  
direction to look. 10 million records searched in less than 2 seconds  
most of the time is great...but those 10 and 20 seconds spikes are not  
going to go over well with our customers...and I somehow think there  
is more we should be able to do here.

Thanks.

Peter S. Lee
ProQuest