You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by vigna <vi...@di.unimi.it> on 2012/12/23 18:37:36 UTC

AbstractNIOConnPool memory leak?

We are using DefaultHttpAsyncClient to download pages along 100 simultaneous
connections at a very high rate (hundreds per second). We are experiencing a
severe memory leak as the routeToPool map in AbstractNIOConnPool becomes
quickly huge and exhausts the memory, apparently keeping some state of all
past connections (which after a few hours are in the order of the millions).

We are using a standard execute() call with an AsyncByteConsumer and a
FutureCallback, so we assumed all resource handling would have been done
automatically. Is there anything we're doing wrong?



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: When less is more; Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sun, 2013-01-06 at 23:14 -0800, vigna wrote:
> > Try reducing the number of concurrent connections from 20k to, say, 2k 
> > and you may be surprised to find out that a smaller number of 
> > connections can actually chew through the same workload faster. If the 
> 
> Well... no. :) We have an experimental setup with a local proxy generating a
> "fake web" that we use to check the speed of the pipeline independently of
> the network conditions.
> 
> With 1000 parallel DefaultHttpClient instances (different instances, not one
> instance with pooling) we download >10000 pages/s.
> 
> With 1000 parallel requests on a DefaultHttpAsyncClient we download >500
> pages/s, but as soon as we try to increase the number of parallel requests
> the speed drops to 100 pages/s, which makes the client useless for us at the
> moment.
> 
> Of course this is somewhat artificial—you don't actually download at
> 100MB/s. But the fact that actually with 2000 parallel requests you go
> *slower* is a problem.
> 

I am sorry but I fail to see how that all proves your point (or
disproves mine). It even sounds completely unrelated to what I was
trying to tell you. Well, then, let us just agree to disagree.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: When less is more; Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
> Try reducing the number of concurrent connections from 20k to, say, 2k 
> and you may be surprised to find out that a smaller number of 
> connections can actually chew through the same workload faster. If the 

Well... no. :) We have an experimental setup with a local proxy generating a
"fake web" that we use to check the speed of the pipeline independently of
the network conditions.

With 1000 parallel DefaultHttpClient instances (different instances, not one
instance with pooling) we download >10000 pages/s.

With 1000 parallel requests on a DefaultHttpAsyncClient we download >500
pages/s, but as soon as we try to increase the number of parallel requests
the speed drops to 100 pages/s, which makes the client useless for us at the
moment.

Of course this is somewhat artificial—you don't actually download at
100MB/s. But the fact that actually with 2000 parallel requests you go
*slower* is a problem.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18667.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


When less is more; Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2013-01-05 at 15:58 -0800, vigna wrote:
> Oh, well, I'm sorry, I'm not really a network person :). I meant that we want
> to keep 20K connections busy and transferring data while respecting
> politeness, not to keep them open in the TCP sense. My fault.
> 
> 

Try reducing the number of concurrent connections from 20k to, say, 2k
and you may be surprised to find out that a smaller number of
connections can actually chew through the same workload faster. If the
JVM spends less time switching between contexts (be it thread context
switching or switching channels in a i/o selector) it is more likely to
spend more time actually doing something useful like reading and
processing data. So, it is _really_ that necessary to keep 20k
connections open at the same time?  

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
Oh, well, I'm sorry, I'm not really a network person :). I meant that we want
to keep 20K connections busy and transferring data while respecting
politeness, not to keep them open in the TCP sense. My fault.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18654.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2013-01-05 at 15:40 -0800, vigna wrote:
> It seems like we are talking about different things. The thousands of open
> connections move beween *hundreds of thousands* of servers. You do not keep
> connections open—and anyway people often sets Apache's httpd timeout for
> reusing connections below a reasonable politeness threshold (e.g., 5s). We
> close connections immediately and move to a new server.
> 
> 

I am not sure. You said previously that you wanted to keep ten thousands
of concurrently open connections for the sake of 'politeness'. To me
that pretty much implies that the I/O reactor has to select through tens
of thousand connections, most of which are not utilized. This certainly
carries a significant cost in terms of performance. (Unless I am missing
something).

Oleg

PS: I'll be off-line shortly and will pick up this thread tomorrow
eventing. 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
It seems like we are talking about different things. The thousands of open
connections move beween *hundreds of thousands* of servers. You do not keep
connections open—and anyway people often sets Apache's httpd timeout for
reusing connections below a reasonable politeness threshold (e.g., 5s). We
close connections immediately and move to a new server.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18649.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 5, 2013, at 3:36pm, Oleg Kalnichevski wrote:

> On Sat, 2013-01-05 at 22:11 +0000, sebb wrote:
>> On 5 January 2013 21:33, vigna <vi...@di.unimi.it> wrote:
>>>> But why would you want a web crawler to have 10-20K simultaneously
>>>> opened connections in the first place?
>>> 
>>> (I thought I answered this, but it's not on the archive. Boh.)
>>> 
>>> Having a few thousands connection open is the only way to retrieve data
>>> respecting politeness (e.g., not banging the same site too often).
>> 
>> Huh?
>> There are surely other ways to achieve that goal.
>> 
> 
> I could not agree more. I personally think that closing idle connections
> and letting the server reclaim the resources associated with them
> (potentially enabling the server to serve other clients) would be more
> 'polite'. It is cheaper for both the client and the server to close
> connections more frequently than keeping them alive just in case.

Just to clarify, for our web crawl we were using a connection pool and letting idle connections be reclaimed.

But we were also doing small batches of URLs (e.g. 5 at a time) when hitting the same server, keeping the connection open. This was an attempt to balance the cost to the target server of establishing a new connection, versus being polite. For typical web sites this feels like a win, but low-traffic sites that have complex pages being generated by JSP code (for example) could be unhappy. I know that Heritrix uses a strategy of varying their crawl delay based on the response time of the server, which could be a better approach to constraining the # of keep-alive requests.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2013-01-05 at 22:11 +0000, sebb wrote:
> On 5 January 2013 21:33, vigna <vi...@di.unimi.it> wrote:
> >> But why would you want a web crawler to have 10-20K simultaneously
> >> opened connections in the first place?
> >
> > (I thought I answered this, but it's not on the archive. Boh.)
> >
> > Having a few thousands connection open is the only way to retrieve data
> > respecting politeness (e.g., not banging the same site too often).
> 
> Huh?
> There are surely other ways to achieve that goal.
> 

I could not agree more. I personally think that closing idle connections
and letting the server reclaim the resources associated with them
(potentially enabling the server to serve other clients) would be more
'polite'. It is cheaper for both the client and the server to close
connections more frequently than keeping them alive just in case.

Oleg 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Oleg,

Thanks for the responses. I've filed a Bixo issue to try using the new minimal version of HttpClient, and also the unlimited connection manager.

I'll try to test using an existing crawl workflow that hits the top-level pages for 60K domains, though that's not exactly the same as a large-scale crawl.

-- Ken


On Jan 7, 2013, at 2:39am, Oleg Kalnichevski wrote:

> On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
>> Hi Oleg,
>> 
>> [snip]
>> 
>>> Ken,
>>> 
>>> You might want to have a look at the lest code in SVN trunk (to be
>>> released as 4.3). Several classes such as the scheme registry that
>>> previously had to be synchronized in order to ensure thread safety have
>>> been replaced with immutable equivalents. There is also now a way to
>>> create HttpClient in a minimal configuration without authentication,
>>> state management (cookies), proxy support and other non-essential
>>> functions.
>> 
>> That sounds interesting - any hints as to how to create this minimal HttpClient?
>> 
> 
> The new API is not yet final and not properly documented. Presently this
> can be done with HttpClients#createMinimal
> 
> 
>>> These functions are not merely disabled but physically
>>> removed from the processing pipeline, which should result in somewhat
>>> better performance in high threads contention scenarios, as the only
>>> synchronization point involved in request execution would be the lock of
>>> the connection pool. Minimal HttpClient may be particularly useful for
>>> anonymous web crawling when authentication and state management are not
>>> required.
>>> 
>>> 
>>>> 3. Global lock on connection pool
>>>> 
>>>> Oleg had written:
>>>> 
>>>>> Yes, your observation is correct. The problem is that the connection
>>>>> pool is guarded by a global lock. Naturally if you have 400 threads
>>>>> trying to obtain a connection at about the same time all of them end up
>>>>> contending for one lock. The problem is that I can't think of a
>>>>> different way to ensure the max limits (per route and total) are
>>>>> guaranteed not to be exceeded. If anyone can think of a better algorithm
>>>>> please do let me know. What might be a possibility is creating a more
>>>>> lenient and less prone to lock contention issues implementation that may
>>>>> under stress occasionally allocate a few more connections than the max
>>>>> limits.
>>>> 
>>>> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
>>>> 
>>> 
>>> I experimented with the idea of lock-less (unlimited) connection manager
>>> but in my tests it did not perform any better than the standard
>>> connection manager.
>> 
>> Previously I'd asked:
>> 
>>> Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?
>> 
>> Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?
>> 
> 
> This may be worthwhile to try. However, in theory this should not
> perform any better than the approach I took with my experiments. The
> main problem is, though, that I do not have a good test framework that
> emulates an environment a web crawler is expected to operate in (and
> have no justification for building one in my spare time). So, this kind
> of effort ideally should be led by an external contributor.
> 
> Oleg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
> Hi Oleg,
> 
> [snip]
> 
> > Ken,
> > 
> > You might want to have a look at the lest code in SVN trunk (to be
> > released as 4.3). Several classes such as the scheme registry that
> > previously had to be synchronized in order to ensure thread safety have
> > been replaced with immutable equivalents. There is also now a way to
> > create HttpClient in a minimal configuration without authentication,
> > state management (cookies), proxy support and other non-essential
> > functions.
> 
> That sounds interesting - any hints as to how to create this minimal HttpClient?
> 

The new API is not yet final and not properly documented. Presently this
can be done with HttpClients#createMinimal


> > These functions are not merely disabled but physically
> > removed from the processing pipeline, which should result in somewhat
> > better performance in high threads contention scenarios, as the only
> > synchronization point involved in request execution would be the lock of
> > the connection pool. Minimal HttpClient may be particularly useful for
> > anonymous web crawling when authentication and state management are not
> > required.
> > 
> > 
> >> 3. Global lock on connection pool
> >> 
> >> Oleg had written:
> >> 
> >>> Yes, your observation is correct. The problem is that the connection
> >>> pool is guarded by a global lock. Naturally if you have 400 threads
> >>> trying to obtain a connection at about the same time all of them end up
> >>> contending for one lock. The problem is that I can't think of a
> >>> different way to ensure the max limits (per route and total) are
> >>> guaranteed not to be exceeded. If anyone can think of a better algorithm
> >>> please do let me know. What might be a possibility is creating a more
> >>> lenient and less prone to lock contention issues implementation that may
> >>> under stress occasionally allocate a few more connections than the max
> >>> limits.
> >> 
> >> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
> >> 
> > 
> > I experimented with the idea of lock-less (unlimited) connection manager
> > but in my tests it did not perform any better than the standard
> > connection manager.
> 
> Previously I'd asked:
> 
> > Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?
> 
> Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?
> 

This may be worthwhile to try. However, in theory this should not
perform any better than the approach I took with my experiments. The
main problem is, though, that I do not have a good test framework that
emulates an environment a web crawler is expected to operate in (and
have no justification for building one in my spare time). So, this kind
of effort ideally should be led by an external contributor.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Oleg,

[snip]

> Ken,
> 
> You might want to have a look at the lest code in SVN trunk (to be
> released as 4.3). Several classes such as the scheme registry that
> previously had to be synchronized in order to ensure thread safety have
> been replaced with immutable equivalents. There is also now a way to
> create HttpClient in a minimal configuration without authentication,
> state management (cookies), proxy support and other non-essential
> functions.

That sounds interesting - any hints as to how to create this minimal HttpClient?

> These functions are not merely disabled but physically
> removed from the processing pipeline, which should result in somewhat
> better performance in high threads contention scenarios, as the only
> synchronization point involved in request execution would be the lock of
> the connection pool. Minimal HttpClient may be particularly useful for
> anonymous web crawling when authentication and state management are not
> required.
> 
> 
>> 3. Global lock on connection pool
>> 
>> Oleg had written:
>> 
>>> Yes, your observation is correct. The problem is that the connection
>>> pool is guarded by a global lock. Naturally if you have 400 threads
>>> trying to obtain a connection at about the same time all of them end up
>>> contending for one lock. The problem is that I can't think of a
>>> different way to ensure the max limits (per route and total) are
>>> guaranteed not to be exceeded. If anyone can think of a better algorithm
>>> please do let me know. What might be a possibility is creating a more
>>> lenient and less prone to lock contention issues implementation that may
>>> under stress occasionally allocate a few more connections than the max
>>> limits.
>> 
>> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
>> 
> 
> I experimented with the idea of lock-less (unlimited) connection manager
> but in my tests it did not perform any better than the standard
> connection manager.

Previously I'd asked:

> Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?

Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2013-01-05 at 15:56 -0800, Ken Krugler wrote:
> On Jan 5, 2013, at 3:31pm, vigna wrote:
> 
> > On 5 Jan 2013, at 3:10 PM, Ken Krugler <kk...@transpac.com> wrote:
> > 
> >> So on a large box (e.g. 24 more powerful cores) I could see using upward
> >> of 10K threads being the 
> >> optimal number.
> > 
> > We are working to make 20-30K connections work on 64 cores.
> > 
> >> Just FYI about two years ago we were using big servers with lots of
> >> threads during a large-scale web 
> >> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
> >> with lots of simultaneous 
> >> threads. I haven't had to revisit those issues with a recent release, so
> >> maybe those have been resolved.
> > 
> > 
> > Can you elaborate on that? I guess it would be priceless knowledge :).
> 
> 1. CookieStore access
> 
> > For example, during a Bixo crawl with 300 threads, I was doing regular thread dumps and inspecting the results. A very high percentage (typically > 1/3) were blocked while waiting to get access to the cookie store. By default there's only one of these per HttpClient.
> > 
> > This one was fairly easy to work around, by creating a cookie store in the local context for each request:
> > 
> >            CookieStore cookieStore = new BasicCookieStore();
> >            localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);
> 
> 2. Scheme registry
> 
> > But I've run into a few other synchronized method/data bottlenecks, which I'm still working through. For example, at irregular intervals the bulk of my fetcher threads are blocked on getting the scheme registry
> 
> I believe this one has been fixed via the patch for https://issues.apache.org/jira/browse/HTTPCLIENT-903, and is in the current release of HttpClient.
> 

Ken,

You might want to have a look at the lest code in SVN trunk (to be
released as 4.3). Several classes such as the scheme registry that
previously had to be synchronized in order to ensure thread safety have
been replaced with immutable equivalents. There is also now a way to
create HttpClient in a minimal configuration without authentication,
state management (cookies), proxy support and other non-essential
functions. These functions are not merely disabled but physically
removed from the processing pipeline, which should result in somewhat
better performance in high threads contention scenarios, as the only
synchronization point involved in request execution would be the lock of
the connection pool. Minimal HttpClient may be particularly useful for
anonymous web crawling when authentication and state management are not
required.


> 3. Global lock on connection pool
> 
> Oleg had written:
> 
> > Yes, your observation is correct. The problem is that the connection
> > pool is guarded by a global lock. Naturally if you have 400 threads
> > trying to obtain a connection at about the same time all of them end up
> > contending for one lock. The problem is that I can't think of a
> > different way to ensure the max limits (per route and total) are
> > guaranteed not to be exceeded. If anyone can think of a better algorithm
> > please do let me know. What might be a possibility is creating a more
> > lenient and less prone to lock contention issues implementation that may
> > under stress occasionally allocate a few more connections than the max
> > limits.
> 
> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
> 

I experimented with the idea of lock-less (unlimited) connection manager
but in my tests it did not perform any better than the standard
connection manager.

I am attaching the source code of my experimental connection manager.
Feel free to improve on it and see if produces better results for your
particular application.

Oleg

> HTH,
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 


Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 5, 2013, at 3:31pm, vigna wrote:

> On 5 Jan 2013, at 3:10 PM, Ken Krugler <kk...@transpac.com> wrote:
> 
>> So on a large box (e.g. 24 more powerful cores) I could see using upward
>> of 10K threads being the 
>> optimal number.
> 
> We are working to make 20-30K connections work on 64 cores.
> 
>> Just FYI about two years ago we were using big servers with lots of
>> threads during a large-scale web 
>> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
>> with lots of simultaneous 
>> threads. I haven't had to revisit those issues with a recent release, so
>> maybe those have been resolved.
> 
> 
> Can you elaborate on that? I guess it would be priceless knowledge :).

1. CookieStore access

> For example, during a Bixo crawl with 300 threads, I was doing regular thread dumps and inspecting the results. A very high percentage (typically > 1/3) were blocked while waiting to get access to the cookie store. By default there's only one of these per HttpClient.
> 
> This one was fairly easy to work around, by creating a cookie store in the local context for each request:
> 
>            CookieStore cookieStore = new BasicCookieStore();
>            localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);

2. Scheme registry

> But I've run into a few other synchronized method/data bottlenecks, which I'm still working through. For example, at irregular intervals the bulk of my fetcher threads are blocked on getting the scheme registry

I believe this one has been fixed via the patch for https://issues.apache.org/jira/browse/HTTPCLIENT-903, and is in the current release of HttpClient.

3. Global lock on connection pool

Oleg had written:

> Yes, your observation is correct. The problem is that the connection
> pool is guarded by a global lock. Naturally if you have 400 threads
> trying to obtain a connection at about the same time all of them end up
> contending for one lock. The problem is that I can't think of a
> different way to ensure the max limits (per route and total) are
> guaranteed not to be exceeded. If anyone can think of a better algorithm
> please do let me know. What might be a possibility is creating a more
> lenient and less prone to lock contention issues implementation that may
> under stress occasionally allocate a few more connections than the max
> limits.

I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.

HTH,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
On 5 Jan 2013, at 3:10 PM, Ken Krugler <kk...@transpac.com> wrote:

> So on a large box (e.g. 24 more powerful cores) I could see using upward
> of 10K threads being the 
> optimal number.

We are working to make 20-30K connections work on 64 cores.

> Just FYI about two years ago we were using big servers with lots of
> threads during a large-scale web 
> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
> with lots of simultaneous 
> threads. I haven't had to revisit those issues with a recent release, so
> maybe those have been resolved.


Can you elaborate on that? I guess it would be priceless knowledge :).

Ciao,

					seba




--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18646.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 5, 2013, at 2:11pm, sebb wrote:

> On 5 January 2013 21:33, vigna <vi...@di.unimi.it> wrote:
>>> But why would you want a web crawler to have 10-20K simultaneously
>>> opened connections in the first place?
>> 
>> (I thought I answered this, but it's not on the archive. Boh.)
>> 
>> Having a few thousands connection open is the only way to retrieve data
>> respecting politeness (e.g., not banging the same site too often).
> 
> Huh?
> There are surely other ways to achieve that goal.

For a beefy server, having a few thousand open connections (one per domain or IP address) is a standard solution for web crawling.

There may well be better solutions, but from personal experience even on a "small" server (e.g. Amazon m1.large, so roughly 4 wimpy cores) you can effectively use 500+ threads.

So on a large box (e.g. 24 more powerful cores) I could see using upward of 10K threads being the optimal number.

This assumes you've got a big pipe, and a pretty good DNS system/cache.

This also assumes that you're not trying to parse the downloaded files at the same time, as otherwise available CPU will be the limiting factor.

Just FYI about two years ago we were using big servers with lots of threads during a large-scale web crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?) with lots of simultaneous threads. I haven't had to revisit those issues with a recent release, so maybe those have been resolved.

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378





---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by sebb <se...@gmail.com>.
On 5 January 2013 21:33, vigna <vi...@di.unimi.it> wrote:
>> But why would you want a web crawler to have 10-20K simultaneously
>> opened connections in the first place?
>
> (I thought I answered this, but it's not on the archive. Boh.)
>
> Having a few thousands connection open is the only way to retrieve data
> respecting politeness (e.g., not banging the same site too often).

Huh?
There are surely other ways to achieve that goal.

> I have another question:

Please start a new thread for a new question.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
> But why would you want a web crawler to have 10-20K simultaneously 
> opened connections in the first place? 

(I thought I answered this, but it's not on the archive. Boh.)

Having a few thousands connection open is the only way to retrieve data
respecting politeness (e.g., not banging the same site too often).

I have another question: is there any suggestion for parameters of the
asynchronous client in case of several thousands parallel requests (e.g.,
for the IOReactor)? We are experimenting both with DefaulHttpClient and
DefaultHttpAsyncClient, and with the same configuration (e.g., 4000 threads
using DefaultHttpClient or 64 threads pushing 4000 async requests into a
default DefaultHttpAsyncClient) we see completely different behaviours. The
sync client fetches more than 10000 pages/s, the async client speed fetches
50 p/s. Should we increase the number of threads or the I/O interval of the
IOReactor?




--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18641.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2012-12-27 at 12:30 -0800, Sebastiano Vigna wrote:
> On 26 December 2012 13:10, Oleg Kalnichevski <ol...@apache.org> wrote:
> 
> >
> > Just out of curiosity, why are using an asynchronous HTTP client for a
> > web crawler? I personally would consider a blocking HTTP client a much
> > better choice for a heavy duty web crawler.
> >
> 
> Well... how would you manage 10-20K simultaneously opened connections with
> a synchronous client?

But why would you want a web crawler to have 10-20K simultaneously
opened connections in the first place?

Oleg

>  We tried with that number of threads each with a
> DefaultHttpClient, but the problem is that as soon as there is any
> contention (even on logging) it slows down terribly everything.



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Sebastiano Vigna <vi...@di.unimi.it>.
On 26 December 2012 13:10, Oleg Kalnichevski <ol...@apache.org> wrote:

>
> Just out of curiosity, why are using an asynchronous HTTP client for a
> web crawler? I personally would consider a blocking HTTP client a much
> better choice for a heavy duty web crawler.
>

Well... how would you manage 10-20K simultaneously opened connections with
a synchronous client? We tried with that number of threads each with a
DefaultHttpClient, but the problem is that as soon as there is any
contention (even on logging) it slows down terribly everything.

Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, 2012-12-25 at 08:53 -0800, vigna wrote:
> Well, if you do a world-wide crawl with order of pages in the billions it
> happens. This particular problem arose with a proxy simulating 100,000,000
> sites during a crawl.
> 
> I agree that it is an event that can happen only with very specific
> applications, like high-performance crawlers, but it is not impossible.
> 
> 

Assuming that the crawler traverses various hosts more or less
sequentially, a very simple fix to the problem would be to remove per
route pools once they become empty in order to prevent the map from
growing beyond the number of total max number of concurrent connections.

Just out of curiosity, why are using an asynchronous HTTP client for a
web crawler? I personally would consider a blocking HTTP client a much
better choice for a heavy duty web crawler.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
Well, if you do a world-wide crawl with order of pages in the billions it
happens. This particular problem arose with a proxy simulating 100,000,000
sites during a crawl.

I agree that it is an event that can happen only with very specific
applications, like high-performance crawlers, but it is not impossible.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18569.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Mon, 2012-12-24 at 21:01 -0800, vigna wrote:
> Well, actually not. I did a careful port-mortem analysis using the Eclipse
> Memory Analyzer, and, simply, the map is very huge. There are more than one
> million entry, and each entry uses approximately 256 bytes for connections
> and routes.
> 

Million entries in the pool? This could only happen if there were
million unique routes tracked by HttpClient. This seems unlikely.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
Well, actually not. I did a careful port-mortem analysis using the Eclipse
Memory Analyzer, and, simply, the map is very huge. There are more than one
million entry, and each entry uses approximately 256 bytes for connections
and routes.

Much of this space is unfortunately overhead of java'util's data structures.
HashMap and HashSet are essentially unusable in any situation in which
memory footprint is relevant. The space used by an entry (48 bytes) is
comparable to the data you're storing.

Replacing HashSet/HashMap with fastutil's
ObjectOpenHashSet/ObjectOpenHashMap or similar structures based on open
hashing would, I believe, half the memory footprint. They're slightly
slower, but their overhead is an order of magnitude smaller.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18567.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sun, 2012-12-23 at 18:33 -0800, vigna wrote:
> I'm following up on myself.
> 
> Apparently there was a mistake from our part: we were allocating an
> asynchronous client for each of our 64 worker threads, instead of having a
> single one. In this way the memory allocation per client of the routeToPool
> (~256M) skyrocketed to 16G. We are now using a single async client for the
> whole application (~20.000 simultaneous connections) and everything seems to
> work much better.
> 
> Nonetheless, the routeToPool map will apparently never shrink. For long-term
> application accessing millions of site this might be a problem. We will see
> whether it is possible to modify the class so to use Google Guava's caches
> for this purpose.
> 
> 

One needs to call #closeExpiredConnections and / or
#closeIdleConnections methods on the connection pool in order to
pro-actively evict expired and / or idle connections from the pool.  

I think the reason for a large memory footprint is not the routeToPool
itself but rather all sorts of stuff still stuck in the I/O session
context from the last request execution. Generally, it is the
responsibility of the caller to remove objects from the local context
upon request completion. However, certain cleanups could be (and should
be) done by the framework. Feel free, though, to raise a JIRA for this
issue and I will make sure it will be looked into.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: AbstractNIOConnPool memory leak?

Posted by Alexey Panchenko <al...@gmail.com>.
Hi,

The pool size can be limited like this, without using other libraries:

new LinkedHashMap<T, RouteSpecificPool<T, C, E>>(initialCapacity,
loadFactor, true)
{
    @Override
    protected boolean removeEldestEntry(Map.Entry<T, RouteSpecificPool<T,
C, E>> eldest) {
        if (size() > MAX_SIZE) {
            eldest.getValue().shutdown();
            remove(eldest.getKey());
        }
        return false;
    }
}

Regards,
Alex



On Mon, Dec 24, 2012 at 9:33 AM, vigna <vi...@di.unimi.it> wrote:

> I'm following up on myself.
>
> Apparently there was a mistake from our part: we were allocating an
> asynchronous client for each of our 64 worker threads, instead of having a
> single one. In this way the memory allocation per client of the routeToPool
> (~256M) skyrocketed to 16G. We are now using a single async client for the
> whole application (~20.000 simultaneous connections) and everything seems
> to
> work much better.
>
> Nonetheless, the routeToPool map will apparently never shrink. For
> long-term
> application accessing millions of site this might be a problem. We will see
> whether it is possible to modify the class so to use Google Guava's caches
> for this purpose.
>
>
>
> --
> View this message in context:
> http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18555.html
> Sent from the HttpClient-User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>

Re: AbstractNIOConnPool memory leak?

Posted by vigna <vi...@di.unimi.it>.
I'm following up on myself.

Apparently there was a mistake from our part: we were allocating an
asynchronous client for each of our 64 worker threads, instead of having a
single one. In this way the memory allocation per client of the routeToPool
(~256M) skyrocketed to 16G. We are now using a single async client for the
whole application (~20.000 simultaneous connections) and everything seems to
work much better.

Nonetheless, the routeToPool map will apparently never shrink. For long-term
application accessing millions of site this might be a problem. We will see
whether it is possible to modify the class so to use Google Guava's caches
for this purpose.



--
View this message in context: http://httpcomponents.10934.n7.nabble.com/AbstractNIOConnPool-memory-leak-tp18554p18555.html
Sent from the HttpClient-User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org