You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Jani, Vrushank" <vr...@truelocal.com.au> on 2015/05/19 09:51:40 UTC

Issue serving concurrent requests to SOLR on PROD


Hello,

We have production SOLR deployed on AWS Cloud. We have currently 4 live SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which is sitting behind Apache httpd. Apache httpd is using prefork mpm and the request flows from ELB to Apache Httpd Server to Tomcat (via AJP).

Last few days, we are seeing increase in the requests around 20000 requests minute hitting the LB. In effect we see ELB Surge Queue Length continuously being around 100.
Surge Queue Length: represents the total number of request pending submission to the instances, queued by the load balancer;

This is causing latencies and time outs from Client applications. Our first reaction was that we don't have enough max connections set either in HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very low CPU and memory utilisation.  Apache preform settings are as below on each servers with keep-alive turned off.

<IfModule prefork.c>
StartServers 8
MinSpareServers 5
MaxSpareServers 20
ServerLimit 256
MaxClients 256
MaxRequestsPerChild 4000
</IfModule>


Tomcat server.xml has following settings.

<Connector port="8080" protocol="AJP/1.3" address="127.0.0.1" maxThreads="500" connectionTimeout="60000"/>
For HTTPD – we see that there are lots of TIME_WAIT connections Apache port around 7000+ but ESTABLISHED connections are around 20.
For Tomact – we see  about 60 ESTABLISHED connections  on tomcat AJP port.

So the servers and connections doesn't look like fully utilised to the capacity. There is no visible stress anywhere.  However we still get requests being queued up on LB because they can not be served from underlying servers.

Can you please help me resolving this issue? Can you see any apparent problem here? Am I missing any configuration or settings for SOLR?

Your help will be truly appreciated.

Regards
VJ






Vrushank Jani [http://media.for.truelocal.com.au/signature/img/divider.png]  Senior Java Developer
T 02 8312 1625[http://media.for.truelocal.com.au/signature/img/divider.png] E vrushank.jani@truelocal.com.au<ma...@truelocal.com.au>

[http://media.for.truelocal.com.au/signature/img/TL_logo.png]<http://www.truelocal.com.au/> [http://media.for.truelocal.com.au/signature/img/TL_facebook.png] <https://www.facebook.com/truelocal>  [http://media.for.truelocal.com.au/signature/img/TL_twitter.png] <https://www.twitter.com/truelocal>  [http://media.for.truelocal.com.au/signature/img/TL_google.png] <https://plus.google.com/+truelocal/posts>  [http://media.for.truelocal.com.au/signature/img/TL_pintrest.png] <http://www/pinterest.com/truelocal>


Re: Issue serving concurrent requests to SOLR on PROD

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/19/2015 1:51 AM, Jani, Vrushank wrote:
> We have production SOLR deployed on AWS Cloud. We have currently 4 live SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which is sitting behind Apache httpd. Apache httpd is using prefork mpm and the request flows from ELB to Apache Httpd Server to Tomcat (via AJP).
> 
> Last few days, we are seeing increase in the requests around 20000 requests minute hitting the LB. In effect we see ELB Surge Queue Length continuously being around 100.
> Surge Queue Length: represents the total number of request pending submission to the instances, queued by the load balancer;
> 
> This is causing latencies and time outs from Client applications. Our first reaction was that we don't have enough max connections set either in HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very low CPU and memory utilisation.  Apache preform settings are as below on each servers with keep-alive turned off.
> 
> <IfModule prefork.c>
> StartServers 8
> MinSpareServers 5
> MaxSpareServers 20
> ServerLimit 256
> MaxClients 256
> MaxRequestsPerChild 4000
> </IfModule>
> 
> 
> Tomcat server.xml has following settings.
> 
> <Connector port="8080" protocol="AJP/1.3" address="127.0.0.1" maxThreads="500" connectionTimeout="60000"/>
> For HTTPD – we see that there are lots of TIME_WAIT connections Apache port around 7000+ but ESTABLISHED connections are around 20.
> For Tomact – we see  about 60 ESTABLISHED connections  on tomcat AJP port.
> 
> So the servers and connections doesn't look like fully utilised to the capacity. There is no visible stress anywhere.  However we still get requests being queued up on LB because they can not be served from underlying servers.
> 
> Can you please help me resolving this issue? Can you see any apparent problem here? Am I missing any configuration or settings for SOLR?

I'm curious about why you have Apache sitting in front of Tomcat.  About
the only reason I can think of to require that step is that you are
using it to require authentication or to deny access to things like the
admin UI.   If you are not doing anything in Apache other than proxying
the traffic, then drop the middleman and use the container directly with
its own HTTP connector.  Or even better, use the Jetty included with Solr.

You should set maxThreads to 10000 in your Tomcat configuration,
effectively removing the limit.  Solr is a multi-threaded Java servlet,
with background threads as well as request-based threads.  Tomcat
requires threads for handling connections, but Solr also requires
threads for its own operation.  The maxThreads limit counts *all* of
those threads, not just the Tomcat threads.

Thanks,
Shawn


Re: Issue serving concurrent requests to SOLR on PROD

Posted by Erick Erickson <er...@gmail.com>.
Just to pile on:

How's your CPU utilization? That's the first place to look. The very first
question to answer is:
"Is Solr the bottleneck or the rest of the infrastructure?". One _very_
quick measure is
CPU utilization. If it's running along at 100% then you need to improve
your queries or add
more Solr nodes. If it's not, then you have more detective work to do
because it could be a lot
of things, I/O contention, your container backing up queries, etc.

Look at the admin UI, the stats section for a core in question. There
you'll see query times for
various percentiles (95th, 99th, etc). Or analyze the QTimes in the Solr
logs. That should give you
a sense of how fast Solr is serving queries.

All that said, you are running about 80 QPS which is humming right along,
my first suspicion is that
you need more Solr instances but check first.

Best,
Erick

On Tue, May 19, 2015 at 7:30 AM, Luis Cappa Banda <lu...@gmail.com>
wrote:

> Hi there,
>
> Unfortunately I don' t agree with Shawn when he suggest to update
> server.xml configuration up to 10000 in maxThreads. If Tomcat (due to the
> concurrent overload you' re suffering, the type of the queries you' re
> handling, etc.) cannot manage the requested queries what could happen is
> that Tomcat internal request queue fills and and Out of Memory may appear
> to say hello to you.
>
> Solr is multithreaded and Tomcat also it is, but those Tomcat threads are
> managed by an internal thread pool with a queue. What Tomcat does is to
> dispatch requests as much it cans over the web applications that are
> deployed in it (in this case, Solr). If Tomcat receives more requests that
> it can answer its internal queue starts to be filled.
>
> Those timeouts from the client side you explained seems to be due to
> Tomcat thread pool and its queue is starting to fill up. You can check it
> monitoring its memory and thread usage and I' m sure you' ll see how it
> grows correlated with the number of concurrent requests they receive. Then,
> for sure you' ll se a more or less horizontal line from memory usage and
> those timeouts will appear from the cliente side.
>
> Basically I think that our scenarios are:
>
>    - Queries are slow. You should check and try to improve them, because
>    maybe they are bad formed and that queries are destroying your performance.
>    Also, check your index configuration (segments number, etc.).
>    - Queries are OK, but you receive more queries that you can handle.
>    Your configuration and everything is well done, but you are trying to
>    consume more requests that you can dispatch and answer.
>
> If you cannot improve your queries, or your queries are OK but you receive
> more requests that the ones you can handle, the only solution you have is
> to scale horizontally and startup new Tomcat + Solrs from 4 to N nodes.
>
>
> Best,
>
>
> - Luis Cappa
>
> 2015-05-19 15:57 GMT+02:00 Michael Della Bitta <md...@gmail.com>:
>
>> Are you sure the requests are getting queued because the LB is detecting
>> that Solr won't handle them?
>>
>> The reason why I'm asking is I know that ELB doesn't handle bursts well.
>> The load balancer needs to "warm up," which essentially means it might be
>> underpowered at the beginning of a burst. It will spool up more resources
>> if the average load over the last minute is high. But for that minute it
>> will definitely not be able to handle a burst.
>>
>> If you're testing infrastructure using a benchmarking tool that doesn't
>> slowly ramp up traffic, you're definitely encountering this problem.
>>
>> Michael
>>
>>   Jani, Vrushank <vr...@truelocal.com.au>
>>  2015-05-19 at 03:51
>>
>> Hello,
>>
>> We have production SOLR deployed on AWS Cloud. We have currently 4 live
>> SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic
>> Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which
>> is sitting behind Apache httpd. Apache httpd is using prefork mpm and the
>> request flows from ELB to Apache Httpd Server to Tomcat (via AJP).
>>
>> Last few days, we are seeing increase in the requests around 20000
>> requests minute hitting the LB. In effect we see ELB Surge Queue Length
>> continuously being around 100.
>> Surge Queue Length: represents the total number of request pending
>> submission to the instances, queued by the load balancer;
>>
>> This is causing latencies and time outs from Client applications. Our
>> first reaction was that we don't have enough max connections set either in
>> HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very
>> low CPU and memory utilisation. Apache preform settings are as below on
>> each servers with keep-alive turned off.
>>
>> <IfModule prefork.c>
>> StartServers 8
>> MinSpareServers 5
>> MaxSpareServers 20
>> ServerLimit 256
>> MaxClients 256
>> MaxRequestsPerChild 4000
>> </IfModule>
>>
>>
>> Tomcat server.xml has following settings.
>>
>> <Connector port="8080" protocol="AJP/1.3" address="127.0.0.1"
>> maxThreads="500" connectionTimeout="60000"/>
>> For HTTPD – we see that there are lots of TIME_WAIT connections Apache
>> port around 7000+ but ESTABLISHED connections are around 20.
>> For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port.
>>
>> So the servers and connections doesn't look like fully utilised to the
>> capacity. There is no visible stress anywhere. However we still get
>> requests being queued up on LB because they can not be served from
>> underlying servers.
>>
>> Can you please help me resolving this issue? Can you see any apparent
>> problem here? Am I missing any configuration or settings for SOLR?
>>
>> Your help will be truly appreciated.
>>
>> Regards
>> VJ
>>
>>
>>
>>
>>
>>
>> Vrushank Jani [
>> http://media.for.truelocal.com.au/signature/img/divider.png] Senior Java
>> Developer
>> T 02 8312 1625[
>> http://media.for.truelocal.com.au/signature/img/divider.png] E
>> vrushank.jani@truelocal.com.au<ma...@truelocal.com.au>
>> <yo...@truelocal.com.au>
>>
>> [http://media.for.truelocal.com.au/signature/img/TL_logo.png]
>> <http://www.truelocal.com.au/> <http://www.truelocal.com.au/> [
>> http://media.for.truelocal.com.au/signature/img/TL_facebook.png]
>> <https://www.facebook.com/truelocal> <https://www.facebook.com/truelocal>
>> [http://media.for.truelocal.com.au/signature/img/TL_twitter.png]
>> <https://www.twitter.com/truelocal> <https://www.twitter.com/truelocal> [
>> http://media.for.truelocal.com.au/signature/img/TL_google.png]
>> <https://plus.google.com/+truelocal/posts>
>> <https://plus.google.com/+truelocal/posts> [
>> http://media.for.truelocal.com.au/signature/img/TL_pintrest.png]
>> <http://www/pinterest.com/truelocal> <http://www/pinterest.com/truelocal>
>>
>>
>>
>
>
> --
> - Luis Cappa
>

Re: Issue serving concurrent requests to SOLR on PROD

Posted by Luis Cappa Banda <lu...@gmail.com>.
Hi there,

Unfortunately I don' t agree with Shawn when he suggest to update
server.xml configuration up to 10000 in maxThreads. If Tomcat (due to the
concurrent overload you' re suffering, the type of the queries you' re
handling, etc.) cannot manage the requested queries what could happen is
that Tomcat internal request queue fills and and Out of Memory may appear
to say hello to you.

Solr is multithreaded and Tomcat also it is, but those Tomcat threads are
managed by an internal thread pool with a queue. What Tomcat does is to
dispatch requests as much it cans over the web applications that are
deployed in it (in this case, Solr). If Tomcat receives more requests that
it can answer its internal queue starts to be filled.

Those timeouts from the client side you explained seems to be due to Tomcat
thread pool and its queue is starting to fill up. You can check it
monitoring its memory and thread usage and I' m sure you' ll see how it
grows correlated with the number of concurrent requests they receive. Then,
for sure you' ll se a more or less horizontal line from memory usage and
those timeouts will appear from the cliente side.

Basically I think that our scenarios are:

   - Queries are slow. You should check and try to improve them, because
   maybe they are bad formed and that queries are destroying your performance.
   Also, check your index configuration (segments number, etc.).
   - Queries are OK, but you receive more queries that you can handle. Your
   configuration and everything is well done, but you are trying to consume
   more requests that you can dispatch and answer.

If you cannot improve your queries, or your queries are OK but you receive
more requests that the ones you can handle, the only solution you have is
to scale horizontally and startup new Tomcat + Solrs from 4 to N nodes.


Best,


- Luis Cappa

2015-05-19 15:57 GMT+02:00 Michael Della Bitta <md...@gmail.com>:

> Are you sure the requests are getting queued because the LB is detecting
> that Solr won't handle them?
>
> The reason why I'm asking is I know that ELB doesn't handle bursts well.
> The load balancer needs to "warm up," which essentially means it might be
> underpowered at the beginning of a burst. It will spool up more resources
> if the average load over the last minute is high. But for that minute it
> will definitely not be able to handle a burst.
>
> If you're testing infrastructure using a benchmarking tool that doesn't
> slowly ramp up traffic, you're definitely encountering this problem.
>
> Michael
>
>   Jani, Vrushank <vr...@truelocal.com.au>
>  2015-05-19 at 03:51
>
> Hello,
>
> We have production SOLR deployed on AWS Cloud. We have currently 4 live
> SOLR servers running on m3xlarge EC2 server instances behind ELB (Elastic
> Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat container which
> is sitting behind Apache httpd. Apache httpd is using prefork mpm and the
> request flows from ELB to Apache Httpd Server to Tomcat (via AJP).
>
> Last few days, we are seeing increase in the requests around 20000
> requests minute hitting the LB. In effect we see ELB Surge Queue Length
> continuously being around 100.
> Surge Queue Length: represents the total number of request pending
> submission to the instances, queued by the load balancer;
>
> This is causing latencies and time outs from Client applications. Our
> first reaction was that we don't have enough max connections set either in
> HTTPD or Tomcat. What we saw, the servers are very lightly loaded with very
> low CPU and memory utilisation. Apache preform settings are as below on
> each servers with keep-alive turned off.
>
> <IfModule prefork.c>
> StartServers 8
> MinSpareServers 5
> MaxSpareServers 20
> ServerLimit 256
> MaxClients 256
> MaxRequestsPerChild 4000
> </IfModule>
>
>
> Tomcat server.xml has following settings.
>
> <Connector port="8080" protocol="AJP/1.3" address="127.0.0.1"
> maxThreads="500" connectionTimeout="60000"/>
> For HTTPD – we see that there are lots of TIME_WAIT connections Apache
> port around 7000+ but ESTABLISHED connections are around 20.
> For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port.
>
> So the servers and connections doesn't look like fully utilised to the
> capacity. There is no visible stress anywhere. However we still get
> requests being queued up on LB because they can not be served from
> underlying servers.
>
> Can you please help me resolving this issue? Can you see any apparent
> problem here? Am I missing any configuration or settings for SOLR?
>
> Your help will be truly appreciated.
>
> Regards
> VJ
>
>
>
>
>
>
> Vrushank Jani [http://media.for.truelocal.com.au/signature/img/divider.png]
> Senior Java Developer
> T 02 8312 1625[http://media.for.truelocal.com.au/signature/img/divider.png]
> E vrushank.jani@truelocal.com.au<ma...@truelocal.com.au>
> <yo...@truelocal.com.au>
>
> [http://media.for.truelocal.com.au/signature/img/TL_logo.png]
> <http://www.truelocal.com.au/> <http://www.truelocal.com.au/> [
> http://media.for.truelocal.com.au/signature/img/TL_facebook.png]
> <https://www.facebook.com/truelocal> <https://www.facebook.com/truelocal>
> [http://media.for.truelocal.com.au/signature/img/TL_twitter.png]
> <https://www.twitter.com/truelocal> <https://www.twitter.com/truelocal> [
> http://media.for.truelocal.com.au/signature/img/TL_google.png]
> <https://plus.google.com/+truelocal/posts>
> <https://plus.google.com/+truelocal/posts> [
> http://media.for.truelocal.com.au/signature/img/TL_pintrest.png]
> <http://www/pinterest.com/truelocal> <http://www/pinterest.com/truelocal>
>
>
>


-- 
- Luis Cappa

Re: Issue serving concurrent requests to SOLR on PROD

Posted by Michael Della Bitta <md...@gmail.com>.
Are you sure the requests are getting queued because the LB is detecting 
that Solr won't handle them?

The reason why I'm asking is I know that ELB doesn't handle bursts well. 
The load balancer needs to "warm up," which essentially means it might 
be underpowered at the beginning of a burst. It will spool up more 
resources if the average load over the last minute is high. But for that 
minute it will definitely not be able to handle a burst.

If you're testing infrastructure using a benchmarking tool that doesn't 
slowly ramp up traffic, you're definitely encountering this problem.

Michael

> Jani, Vrushank <ma...@truelocal.com.au>
> 2015-05-19 at 03:51
>
> Hello,
>
> We have production SOLR deployed on AWS Cloud. We have currently 4 
> live SOLR servers running on m3xlarge EC2 server instances behind ELB 
> (Elastic Load Balancer) on AWS cloud. We run Apache SOLR in Tomcat 
> container which is sitting behind Apache httpd. Apache httpd is using 
> prefork mpm and the request flows from ELB to Apache Httpd Server to 
> Tomcat (via AJP).
>
> Last few days, we are seeing increase in the requests around 20000 
> requests minute hitting the LB. In effect we see ELB Surge Queue 
> Length continuously being around 100.
> Surge Queue Length: represents the total number of request pending 
> submission to the instances, queued by the load balancer;
>
> This is causing latencies and time outs from Client applications. Our 
> first reaction was that we don't have enough max connections set 
> either in HTTPD or Tomcat. What we saw, the servers are very lightly 
> loaded with very low CPU and memory utilisation. Apache preform 
> settings are as below on each servers with keep-alive turned off.
>
> <IfModule prefork.c>
> StartServers 8
> MinSpareServers 5
> MaxSpareServers 20
> ServerLimit 256
> MaxClients 256
> MaxRequestsPerChild 4000
> </IfModule>
>
>
> Tomcat server.xml has following settings.
>
> <Connector port="8080" protocol="AJP/1.3" address="127.0.0.1" 
> maxThreads="500" connectionTimeout="60000"/>
> For HTTPD – we see that there are lots of TIME_WAIT connections Apache 
> port around 7000+ but ESTABLISHED connections are around 20.
> For Tomact – we see about 60 ESTABLISHED connections on tomcat AJP port.
>
> So the servers and connections doesn't look like fully utilised to the 
> capacity. There is no visible stress anywhere. However we still get 
> requests being queued up on LB because they can not be served from 
> underlying servers.
>
> Can you please help me resolving this issue? Can you see any apparent 
> problem here? Am I missing any configuration or settings for SOLR?
>
> Your help will be truly appreciated.
>
> Regards
> VJ
>
>
>
>
>
>
> Vrushank Jani 
> [http://media.for.truelocal.com.au/signature/img/divider.png] Senior 
> Java Developer
> T 02 8312 
> 1625[http://media.for.truelocal.com.au/signature/img/divider.png] E 
> vrushank.jani@truelocal.com.au<ma...@truelocal.com.au>
>
> [http://media.for.truelocal.com.au/signature/img/TL_logo.png]<http://www.truelocal.com.au/> 
> [http://media.for.truelocal.com.au/signature/img/TL_facebook.png] 
> <https://www.facebook.com/truelocal> 
> [http://media.for.truelocal.com.au/signature/img/TL_twitter.png] 
> <https://www.twitter.com/truelocal> 
> [http://media.for.truelocal.com.au/signature/img/TL_google.png] 
> <https://plus.google.com/+truelocal/posts> 
> [http://media.for.truelocal.com.au/signature/img/TL_pintrest.png] 
> <http://www/pinterest.com/truelocal>
>
>