You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jared Rodriguez <jr...@kitedesk.com> on 2014/02/13 21:38:43 UTC

SolrJ Socket Leak

I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
of a web application which connects to the solr server via solrj
using CloudSolrServer();  The web application is wired up with Guice, and
there is a single instance of the CloudSolrServer class used by all inbound
requests.  All this is running on Amazon.

Basically, everything looks and runs fine for a while, but even with
moderate concurrency, solrj starts leaving sockets open.  We are handling
only about 250 connections to the web app per minute and each of these
issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
of use, we end up with many 1000s of lingering sockets.  I can see these
when running netstats

tcp        0      0 ip-10-80-14-26.ec2.in:41098 ip-10-99-145-47.ec2.i:glrpc
TIME_WAIT

All to the same target host, which is my solr server. There are no other
pieces of infrastructure on that box, just solr.  Eventually, the server
just dies as no further sockets can be opened and the opened ones are not
reused.

The solr server itself is unphased and running like a champ.  Average timer
per request of 0.126, as seen in the solr web app admin UI query handler
stats.

Apache httpclient had a bunch of leakage from version 4.2.x that they
cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
solrj makes use of the old leaky 4.2 classes for establishing connections
and using a connection pool.

http://www.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.3.x.txt



-- 
Jared Rodriguez

Re: SolrJ Socket Leak

Posted by Jared Rodriguez <jr...@kitedesk.com>.
Kiran & Shawn,

Thank you both for the info and you are both absolutely correct.  The issue
was not that sockets were leaked, but that wait time thing is a killer.  I
ended up fixing the problem by changing the system property of
"http.maxConnections" which is used internally to Apache httpclient to
setup the PoolingClientConnectionManager.  Previously, this had no value,
and was defaulting to 5.  That meant that any time there were more than 50
(maxConnections * maxperroute) concurrent connections to the Solr server,
non reusable connections were opening and closing and thus sitting in that
idle state .. too many sockets.

The fix was simply tuning the pool and setting "http.maxConnections" to a
higher value representing the number of concurrent users that I expect.
 Problem fixed, and a modest speed improvement simply by higher socket
reuse.

Thank you both for the help!

Jared




On Mon, Feb 17, 2014 at 3:03 AM, Kiran Chitturi <
kiran.chitturi@lucidworks.com> wrote:

> Jared,
>
> I faced a similar issue when using CloudSolrServer with Solr. As Shawn
> pointed out the 'TIME_WAIT' status happens when the connection is closed
> by the http client. HTTP client closes connection whenever it thinks the
> connection is stale
> (
> https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
> #d5e405). Even the docs point out the stale connection checking cannot be
> all reliable.
>
> I see two ways to get around this:
>
>         1. Enable 'SO_REUSEADDR'
>         2. Disable stale connection checks.
>
> Also by default, when we create CSS it does not explicitly configure any
> http client parameters
> (
> https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/a
> pache/solr/client/solrj/impl/CloudSolrServer.java#L124). In this case, the
> default configuration parameters (max connections, max connections per
> host) are used for a http connection. You can explicitly configure these
> params when creating CSS using HttpClientUtil:
>
>         ModifiableSolrParams params = new ModifiableSolrParams();
>         params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 128);
>         params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 32);
>         params.set(HttpClientUtil.PROP_FOLLOW_REDIRECTS, false);
>         params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 30000);
>         httpClient = HttpClientUtil.createClient(params);
>
>         final HttpClient client = HttpClientUtil.createClient(params);
>         LBHttpSolrServer lb = new LBHttpSolrServer(client);
>         CloudSolrServer server = new CloudSolrServer(zkConnect, lb);
>
>
> Currently, I am using http client 4.3.2 and building the client when
> creating the CSS. I also use 'SO_REUSEADDR' option and I haven't seen the
> 'TIME_WAIT'  after this (may be because of better handling of stale
> connections in 4.3.2 or because of 'SO_REUSEADDR' param enabled). My
> current http client code looks like this: (works only with http client
> 4.3.2)
>
>         HttpClientBuilder httpBuilder = HttpClientBuilder.create();
>
>         Builder socketConfig =  SocketConfig.custom();
>         socketConfig.setSoReuseAddress(true);
>         socketConfig.setSoTimeout(10000);
>         httpBuilder.setDefaultSocketConfig(socketConfig.build());
>         httpBuilder.setMaxConnTotal(300);
>         httpBuilder.setMaxConnPerRoute(100);
>
>         httpBuilder.disableRedirectHandling();
>         httpBuilder.useSystemProperties();
>         LBHttpSolrServer lb = new LBHttpSolrServer(httpClient, parser)
>         CloudSolrServer server = new CloudSolrServer(zkConnect, lb);
>
>
> There should be a way to configure socket reuse with 4.2.3 too. You can
> try different configurations. I am surprised you have 'TIME_WAIT'
> connections even after 30 minutes because 'TIME_WAIT' connection should be
> closed by default in 2 mins by O.S I think.
>
>
> HTH,
>
> --
> Kiran Chitturi,
>
>
> On 2/13/14 12:38 PM, "Jared Rodriguez" <jr...@kitedesk.com> wrote:
>
> >I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
> >of a web application which connects to the solr server via solrj
> >using CloudSolrServer();  The web application is wired up with Guice, and
> >there is a single instance of the CloudSolrServer class used by all
> >inbound
> >requests.  All this is running on Amazon.
> >
> >Basically, everything looks and runs fine for a while, but even with
> >moderate concurrency, solrj starts leaving sockets open.  We are handling
> >only about 250 connections to the web app per minute and each of these
> >issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
> >of use, we end up with many 1000s of lingering sockets.  I can see these
> >when running netstats
> >
> >tcp        0      0 ip-10-80-14-26.ec2.in:41098
> >ip-10-99-145-47.ec2.i:glrpc
> >TIME_WAIT
> >
> >All to the same target host, which is my solr server. There are no other
> >pieces of infrastructure on that box, just solr.  Eventually, the server
> >just dies as no further sockets can be opened and the opened ones are not
> >reused.
> >
> >The solr server itself is unphased and running like a champ.  Average
> >timer
> >per request of 0.126, as seen in the solr web app admin UI query handler
> >stats.
> >
> >Apache httpclient had a bunch of leakage from version 4.2.x that they
> >cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
> >solrj makes use of the old leaky 4.2 classes for establishing connections
> >and using a connection pool.
> >
> >
> http://www.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.3.x.t
> >xt
> >
> >
> >
> >--
> >Jared Rodriguez
>
>


-- 
Jared Rodriguez

Re: SolrJ Socket Leak

Posted by Kiran Chitturi <ki...@lucidworks.com>.
Jared,

I faced a similar issue when using CloudSolrServer with Solr. As Shawn
pointed out the 'TIME_WAIT' status happens when the connection is closed
by the http client. HTTP client closes connection whenever it thinks the
connection is stale
(https://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html
#d5e405). Even the docs point out the stale connection checking cannot be
all reliable. 

I see two ways to get around this:

	1. Enable 'SO_REUSEADDR'
	2. Disable stale connection checks.

Also by default, when we create CSS it does not explicitly configure any
http client parameters
(https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/a
pache/solr/client/solrj/impl/CloudSolrServer.java#L124). In this case, the
default configuration parameters (max connections, max connections per
host) are used for a http connection. You can explicitly configure these
params when creating CSS using HttpClientUtil:

	ModifiableSolrParams params = new ModifiableSolrParams();
	params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 128);
	params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 32);
	params.set(HttpClientUtil.PROP_FOLLOW_REDIRECTS, false);
	params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 30000);
	httpClient = HttpClientUtil.createClient(params);

	final HttpClient client = HttpClientUtil.createClient(params);
	LBHttpSolrServer lb = new LBHttpSolrServer(client);
	CloudSolrServer server = new CloudSolrServer(zkConnect, lb);


Currently, I am using http client 4.3.2 and building the client when
creating the CSS. I also use 'SO_REUSEADDR' option and I haven't seen the
'TIME_WAIT'  after this (may be because of better handling of stale
connections in 4.3.2 or because of 'SO_REUSEADDR' param enabled). My
current http client code looks like this: (works only with http client
4.3.2)

	HttpClientBuilder httpBuilder = HttpClientBuilder.create();
    
	Builder socketConfig =  SocketConfig.custom();
	socketConfig.setSoReuseAddress(true);
	socketConfig.setSoTimeout(10000);
	httpBuilder.setDefaultSocketConfig(socketConfig.build());
	httpBuilder.setMaxConnTotal(300);
	httpBuilder.setMaxConnPerRoute(100);
    
	httpBuilder.disableRedirectHandling();
	httpBuilder.useSystemProperties();
	LBHttpSolrServer lb = new LBHttpSolrServer(httpClient, parser)
	CloudSolrServer server = new CloudSolrServer(zkConnect, lb);


There should be a way to configure socket reuse with 4.2.3 too. You can
try different configurations. I am surprised you have 'TIME_WAIT'
connections even after 30 minutes because 'TIME_WAIT' connection should be
closed by default in 2 mins by O.S I think.


HTH,

-- 
Kiran Chitturi,


On 2/13/14 12:38 PM, "Jared Rodriguez" <jr...@kitedesk.com> wrote:

>I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
>of a web application which connects to the solr server via solrj
>using CloudSolrServer();  The web application is wired up with Guice, and
>there is a single instance of the CloudSolrServer class used by all
>inbound
>requests.  All this is running on Amazon.
>
>Basically, everything looks and runs fine for a while, but even with
>moderate concurrency, solrj starts leaving sockets open.  We are handling
>only about 250 connections to the web app per minute and each of these
>issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
>of use, we end up with many 1000s of lingering sockets.  I can see these
>when running netstats
>
>tcp        0      0 ip-10-80-14-26.ec2.in:41098
>ip-10-99-145-47.ec2.i:glrpc
>TIME_WAIT
>
>All to the same target host, which is my solr server. There are no other
>pieces of infrastructure on that box, just solr.  Eventually, the server
>just dies as no further sockets can be opened and the opened ones are not
>reused.
>
>The solr server itself is unphased and running like a champ.  Average
>timer
>per request of 0.126, as seen in the solr web app admin UI query handler
>stats.
>
>Apache httpclient had a bunch of leakage from version 4.2.x that they
>cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
>solrj makes use of the old leaky 4.2 classes for establishing connections
>and using a connection pool.
>
>http://www.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.3.x.t
>xt
>
>
>
>-- 
>Jared Rodriguez


Re: SolrJ Socket Leak

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/14/2014 2:45 AM, Jared Rodriguez wrote:
> Thanks for the info, I will look into the open file count and try to
> provide more info on how this is occurring.
> 
> Just to make sure that our scenarios were the same, in your tests did you
> simulate many concurrent inbound connections to your web app, with each
> connection sharing the same instance of HttpSolrServer for queries?

I've bumped the max open file limit (in /etc/security/limits.conf on
CentOS) to a soft/hard limit of 49151/65535.  I've also bumped the
process limits to 4096/6144.  These are specific to the user that runs
Solr and other related programs.

My SolrJ program is not actually a web application.  It is my indexing
application, a standalone java program.  We do use SolrJ in our web
application, but that's handled by someone else.  I do know that it uses
a single HttpSolrServer instance across the entire webapp.

When this specific copy of the indexing application (for my dev server)
starts up, it creates 15 HttpSolrServer instances that are used for the
life of the application.  The application will run for weeks or months
at a time and has never had a problem with leaks.

One of these instances points at the /solr URL, which I use for
CoreAdminRequest queries.  Each of the other 14 point at one of the Solr
cores.  My production copy, which has a config file to update two copies
of the index on four servers, creates 32 instances -- four of them for
CoreAdmin requests and 28 of them for cores.

Updates are run once a minute.  One cycle will typically involve several
Solr requests.  Sometimes they are queries, but most of the time they
are update requests.

The application uses database connection pooling (Apache Commons code)
to talk to a MySQL server, pulls in data for indexing, and then sends
requests to Solr.  Most of the time, it only goes to one HttpSolrServer
instance, the core where all new data lives.  Occasionally it will talk
to up to seven of the 14 HttpSolrServer instances -- the ones pointing
at the "live" cores.

When a full rebuild is underway, it starts the dataimport handler on the
seven build cores.  As part of the once-a-minute update cycle, it also
gathers status information on those dataimports.  When the rebuild
finishes, it runs an update on those cores and then does CoreAdmin SWAP
requests to switch to the new index.

I did run a rebuild, and I let the normal indexing run for a really long
time, so I could be sure that it was using all HttpSolrServer instances.
 It never had more than a few dozen connections listed in the netstat
output.

Thanks,
Shawn


Re: SolrJ Socket Leak

Posted by Jared Rodriguez <jr...@kitedesk.com>.
Thanks for the info, I will look into the open file count and try to
provide more info on how this is occurring.

Just to make sure that our scenarios were the same, in your tests did you
simulate many concurrent inbound connections to your web app, with each
connection sharing the same instance of HttpSolrServer for queries?


On Thu, Feb 13, 2014 at 6:58 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/13/2014 3:17 PM, Jared Rodriguez wrote:
>
>> I just regressed to Solrj 4.6.1 with http client 4.2.6 and am trying to
>> reproduce the problem.  Using YourKit to profile and even just manually
>> simulating a few users at once, I see the same problem of open sockets.  6
>> sockets opened to the solr server and 2 of them still open after all is
>> done and there is no server activity.  Although this could be sockets kept
>> in a connection pool.
>>
>
> I did two separate upgrade steps, SolrJ 4.5.1 to 4.6.1, and HttpClient
> 4.3.1 to 4.3.2, and I'm not seeing any evidence of connection leaks.
>
>
> On your connections, if they are in TIME_WAIT, I'm pretty sure that means
> that the program is done with them because it's closed the connection and
> it's the operating system that is in charge.  See the answer with the green
> checkmark here:
>
> http://superuser.com/questions/173535/what-are-close-wait-and-time-wait-
> states
>
> I think the default timeout for WAIT states on a modern Linux system is 60
> seconds, not four minutes as described on that answer.
>
> With your connection rate and the default 60 second timeout for WAIT
> states, another resource that might be in short supply is file descriptors.
>
> Thanks,
> Shawn
>
>


-- 
Jared Rodriguez

Re: SolrJ Socket Leak

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/13/2014 3:17 PM, Jared Rodriguez wrote:
> I just regressed to Solrj 4.6.1 with http client 4.2.6 and am trying to
> reproduce the problem.  Using YourKit to profile and even just manually
> simulating a few users at once, I see the same problem of open sockets.  6
> sockets opened to the solr server and 2 of them still open after all is
> done and there is no server activity.  Although this could be sockets kept
> in a connection pool.

I did two separate upgrade steps, SolrJ 4.5.1 to 4.6.1, and HttpClient 
4.3.1 to 4.3.2, and I'm not seeing any evidence of connection leaks.


On your connections, if they are in TIME_WAIT, I'm pretty sure that 
means that the program is done with them because it's closed the 
connection and it's the operating system that is in charge.  See the 
answer with the green checkmark here:

http://superuser.com/questions/173535/what-are-close-wait-and-time-wait-states

I think the default timeout for WAIT states on a modern Linux system is 
60 seconds, not four minutes as described on that answer.

With your connection rate and the default 60 second timeout for WAIT 
states, another resource that might be in short supply is file descriptors.

Thanks,
Shawn


Re: SolrJ Socket Leak

Posted by Jared Rodriguez <jr...@kitedesk.com>.
Thanks Shawn,

I just regressed to Solrj 4.6.1 with http client 4.2.6 and am trying to
reproduce the problem.  Using YourKit to profile and even just manually
simulating a few users at once, I see the same problem of open sockets.  6
sockets opened to the solr server and 2 of them still open after all is
done and there is no server activity.  Although this could be sockets kept
in a connection pool.


On Thu, Feb 13, 2014 at 4:11 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/13/2014 1:38 PM, Jared Rodriguez wrote:
>
>> I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
>> of a web application which connects to the solr server via solrj
>> using CloudSolrServer();  The web application is wired up with Guice, and
>> there is a single instance of the CloudSolrServer class used by all
>> inbound
>> requests.  All this is running on Amazon.
>>
>> Basically, everything looks and runs fine for a while, but even with
>> moderate concurrency, solrj starts leaving sockets open.  We are handling
>> only about 250 connections to the web app per minute and each of these
>> issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
>> of use, we end up with many 1000s of lingering sockets.  I can see these
>> when running netstats
>>
>> tcp        0      0 ip-10-80-14-26.ec2.in:41098ip-10-99-145-47.ec2.i:glrpc
>> TIME_WAIT
>>
>> All to the same target host, which is my solr server. There are no other
>> pieces of infrastructure on that box, just solr.  Eventually, the server
>> just dies as no further sockets can be opened and the opened ones are not
>> reused.
>>
>> The solr server itself is unphased and running like a champ.  Average
>> timer
>> per request of 0.126, as seen in the solr web app admin UI query handler
>> stats.
>>
>> Apache httpclient had a bunch of leakage from version 4.2.x that they
>> cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
>> solrj makes use of the old leaky 4.2 classes for establishing connections
>> and using a connection pool.
>>
>
> This is something that I can look into.
>
> I have a SolrJ program with SolrJ 4.5.1 and HttpClient 4.3.1 that does not
> leak anything.  I thought it was migrated already to SolrJ 4.6.1, but now
> that I know it's not, I will upgrade SolrJ first and then HttpClient, and
> see whether I have the same problem with either upgrade.
>
> I am using HttpSolrServer, not CloudSolrServer, because the Solr servers
> are not running SolrCloud.  CloudSolrServer ultimately uses HttpSolrServer
> for its communication, so my initial thought is that this is not important,
> but we'll see.
>
> In version 4.7, Solr will include HttpClient 4.3.1.  See SOLR-5590.
>
> https://issues.apache.org/jira/browse/SOLR-5590
>
> A question for committers with a lot of experience: Do we have any tests
> that check for connection leaks?
>
> Thanks,
> Shawn
>
>


-- 
Jared Rodriguez

Re: SolrJ Socket Leak

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/13/2014 1:38 PM, Jared Rodriguez wrote:
> I am using solr/solrj 4.6.1 along with the apache httpclient 4.3.2 as part
> of a web application which connects to the solr server via solrj
> using CloudSolrServer();  The web application is wired up with Guice, and
> there is a single instance of the CloudSolrServer class used by all inbound
> requests.  All this is running on Amazon.
>
> Basically, everything looks and runs fine for a while, but even with
> moderate concurrency, solrj starts leaving sockets open.  We are handling
> only about 250 connections to the web app per minute and each of these
> issues from 3 - 7 requests to solr.  Over a 30 minute period of this type
> of use, we end up with many 1000s of lingering sockets.  I can see these
> when running netstats
>
> tcp        0      0 ip-10-80-14-26.ec2.in:41098 ip-10-99-145-47.ec2.i:glrpc
> TIME_WAIT
>
> All to the same target host, which is my solr server. There are no other
> pieces of infrastructure on that box, just solr.  Eventually, the server
> just dies as no further sockets can be opened and the opened ones are not
> reused.
>
> The solr server itself is unphased and running like a champ.  Average timer
> per request of 0.126, as seen in the solr web app admin UI query handler
> stats.
>
> Apache httpclient had a bunch of leakage from version 4.2.x that they
> cleaned up and refactored in 4.3.x, which is why I upgraded.  Currently,
> solrj makes use of the old leaky 4.2 classes for establishing connections
> and using a connection pool.

This is something that I can look into.

I have a SolrJ program with SolrJ 4.5.1 and HttpClient 4.3.1 that does 
not leak anything.  I thought it was migrated already to SolrJ 4.6.1, 
but now that I know it's not, I will upgrade SolrJ first and then 
HttpClient, and see whether I have the same problem with either upgrade.

I am using HttpSolrServer, not CloudSolrServer, because the Solr servers 
are not running SolrCloud.  CloudSolrServer ultimately uses 
HttpSolrServer for its communication, so my initial thought is that this 
is not important, but we'll see.

In version 4.7, Solr will include HttpClient 4.3.1.  See SOLR-5590.

https://issues.apache.org/jira/browse/SOLR-5590

A question for committers with a lot of experience: Do we have any tests 
that check for connection leaks?

Thanks,
Shawn