You are viewing a plain text version of this content. The canonical link for it is here.

Posted to httpclient-users@hc.apache.org by Oleg Kalnichevski <ol...@apache.org> on 2013/01/06 13:55:56 UTC

HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

On Sat, 2013-01-05 at 15:56 -0800, Ken Krugler wrote:
> On Jan 5, 2013, at 3:31pm, vigna wrote:
> 
> > On 5 Jan 2013, at 3:10 PM, Ken Krugler <kk...@transpac.com> wrote:
> > 
> >> So on a large box (e.g. 24 more powerful cores) I could see using upward
> >> of 10K threads being the 
> >> optimal number.
> > 
> > We are working to make 20-30K connections work on 64 cores.
> > 
> >> Just FYI about two years ago we were using big servers with lots of
> >> threads during a large-scale web 
> >> crawl, and we did run into interesting bottlenecks in HttpClient 4.0.1 (?)
> >> with lots of simultaneous 
> >> threads. I haven't had to revisit those issues with a recent release, so
> >> maybe those have been resolved.
> > 
> > 
> > Can you elaborate on that? I guess it would be priceless knowledge :).
> 
> 1. CookieStore access
> 
> > For example, during a Bixo crawl with 300 threads, I was doing regular thread dumps and inspecting the results. A very high percentage (typically > 1/3) were blocked while waiting to get access to the cookie store. By default there's only one of these per HttpClient.
> > 
> > This one was fairly easy to work around, by creating a cookie store in the local context for each request:
> > 
> >            CookieStore cookieStore = new BasicCookieStore();
> >            localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);
> 
> 2. Scheme registry
> 
> > But I've run into a few other synchronized method/data bottlenecks, which I'm still working through. For example, at irregular intervals the bulk of my fetcher threads are blocked on getting the scheme registry
> 
> I believe this one has been fixed via the patch for https://issues.apache.org/jira/browse/HTTPCLIENT-903, and is in the current release of HttpClient.
> 

Ken,

You might want to have a look at the lest code in SVN trunk (to be
released as 4.3). Several classes such as the scheme registry that
previously had to be synchronized in order to ensure thread safety have
been replaced with immutable equivalents. There is also now a way to
create HttpClient in a minimal configuration without authentication,
state management (cookies), proxy support and other non-essential
functions. These functions are not merely disabled but physically
removed from the processing pipeline, which should result in somewhat
better performance in high threads contention scenarios, as the only
synchronization point involved in request execution would be the lock of
the connection pool. Minimal HttpClient may be particularly useful for
anonymous web crawling when authentication and state management are not
required.


> 3. Global lock on connection pool
> 
> Oleg had written:
> 
> > Yes, your observation is correct. The problem is that the connection
> > pool is guarded by a global lock. Naturally if you have 400 threads
> > trying to obtain a connection at about the same time all of them end up
> > contending for one lock. The problem is that I can't think of a
> > different way to ensure the max limits (per route and total) are
> > guaranteed not to be exceeded. If anyone can think of a better algorithm
> > please do let me know. What might be a possibility is creating a more
> > lenient and less prone to lock contention issues implementation that may
> > under stress occasionally allocate a few more connections than the max
> > limits.
> 
> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
> 

I experimented with the idea of lock-less (unlimited) connection manager
but in my tests it did not perform any better than the standard
connection manager.

I am attaching the source code of my experimental connection manager.
Feel free to improve on it and see if produces better results for your
particular application.

Oleg

> HTH,
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
>

Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.

Hi Oleg,

Thanks for the responses. I've filed a Bixo issue to try using the new minimal version of HttpClient, and also the unlimited connection manager.

I'll try to test using an existing crawl workflow that hits the top-level pages for 60K domains, though that's not exactly the same as a large-scale crawl.

-- Ken


On Jan 7, 2013, at 2:39am, Oleg Kalnichevski wrote:

> On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
>> Hi Oleg,
>> 
>> [snip]
>> 
>>> Ken,
>>> 
>>> You might want to have a look at the lest code in SVN trunk (to be
>>> released as 4.3). Several classes such as the scheme registry that
>>> previously had to be synchronized in order to ensure thread safety have
>>> been replaced with immutable equivalents. There is also now a way to
>>> create HttpClient in a minimal configuration without authentication,
>>> state management (cookies), proxy support and other non-essential
>>> functions.
>> 
>> That sounds interesting - any hints as to how to create this minimal HttpClient?
>> 
> 
> The new API is not yet final and not properly documented. Presently this
> can be done with HttpClients#createMinimal
> 
> 
>>> These functions are not merely disabled but physically
>>> removed from the processing pipeline, which should result in somewhat
>>> better performance in high threads contention scenarios, as the only
>>> synchronization point involved in request execution would be the lock of
>>> the connection pool. Minimal HttpClient may be particularly useful for
>>> anonymous web crawling when authentication and state management are not
>>> required.
>>> 
>>> 
>>>> 3. Global lock on connection pool
>>>> 
>>>> Oleg had written:
>>>> 
>>>>> Yes, your observation is correct. The problem is that the connection
>>>>> pool is guarded by a global lock. Naturally if you have 400 threads
>>>>> trying to obtain a connection at about the same time all of them end up
>>>>> contending for one lock. The problem is that I can't think of a
>>>>> different way to ensure the max limits (per route and total) are
>>>>> guaranteed not to be exceeded. If anyone can think of a better algorithm
>>>>> please do let me know. What might be a possibility is creating a more
>>>>> lenient and less prone to lock contention issues implementation that may
>>>>> under stress occasionally allocate a few more connections than the max
>>>>> limits.
>>>> 
>>>> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
>>>> 
>>> 
>>> I experimented with the idea of lock-less (unlimited) connection manager
>>> but in my tests it did not perform any better than the standard
>>> connection manager.
>> 
>> Previously I'd asked:
>> 
>>> Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?
>> 
>> Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?
>> 
> 
> This may be worthwhile to try. However, in theory this should not
> perform any better than the approach I took with my experiments. The
> main problem is, though, that I do not have a good test framework that
> emulates an environment a web crawler is expected to operate in (and
> have no justification for building one in my spare time). So, this kind
> of effort ideally should be led by an external contributor.
> 
> Oleg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
> Hi Oleg,
> 
> [snip]
> 
> > Ken,
> > 
> > You might want to have a look at the lest code in SVN trunk (to be
> > released as 4.3). Several classes such as the scheme registry that
> > previously had to be synchronized in order to ensure thread safety have
> > been replaced with immutable equivalents. There is also now a way to
> > create HttpClient in a minimal configuration without authentication,
> > state management (cookies), proxy support and other non-essential
> > functions.
> 
> That sounds interesting - any hints as to how to create this minimal HttpClient?
> 

The new API is not yet final and not properly documented. Presently this
can be done with HttpClients#createMinimal


> > These functions are not merely disabled but physically
> > removed from the processing pipeline, which should result in somewhat
> > better performance in high threads contention scenarios, as the only
> > synchronization point involved in request execution would be the lock of
> > the connection pool. Minimal HttpClient may be particularly useful for
> > anonymous web crawling when authentication and state management are not
> > required.
> > 
> > 
> >> 3. Global lock on connection pool
> >> 
> >> Oleg had written:
> >> 
> >>> Yes, your observation is correct. The problem is that the connection
> >>> pool is guarded by a global lock. Naturally if you have 400 threads
> >>> trying to obtain a connection at about the same time all of them end up
> >>> contending for one lock. The problem is that I can't think of a
> >>> different way to ensure the max limits (per route and total) are
> >>> guaranteed not to be exceeded. If anyone can think of a better algorithm
> >>> please do let me know. What might be a possibility is creating a more
> >>> lenient and less prone to lock contention issues implementation that may
> >>> under stress occasionally allocate a few more connections than the max
> >>> limits.
> >> 
> >> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
> >> 
> > 
> > I experimented with the idea of lock-less (unlimited) connection manager
> > but in my tests it did not perform any better than the standard
> > connection manager.
> 
> Previously I'd asked:
> 
> > Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?
> 
> Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?
> 

This may be worthwhile to try. However, in theory this should not
perform any better than the approach I took with my experiments. The
main problem is, though, that I do not have a good test framework that
emulates an environment a web crawler is expected to operate in (and
have no justification for building one in my spare time). So, this kind
of effort ideally should be led by an external contributor.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?

Posted by Ken Krugler <kk...@transpac.com>.

Hi Oleg,

[snip]

> Ken,
> 
> You might want to have a look at the lest code in SVN trunk (to be
> released as 4.3). Several classes such as the scheme registry that
> previously had to be synchronized in order to ensure thread safety have
> been replaced with immutable equivalents. There is also now a way to
> create HttpClient in a minimal configuration without authentication,
> state management (cookies), proxy support and other non-essential
> functions.

That sounds interesting - any hints as to how to create this minimal HttpClient?

> These functions are not merely disabled but physically
> removed from the processing pipeline, which should result in somewhat
> better performance in high threads contention scenarios, as the only
> synchronization point involved in request execution would be the lock of
> the connection pool. Minimal HttpClient may be particularly useful for
> anonymous web crawling when authentication and state management are not
> required.
> 
> 
>> 3. Global lock on connection pool
>> 
>> Oleg had written:
>> 
>>> Yes, your observation is correct. The problem is that the connection
>>> pool is guarded by a global lock. Naturally if you have 400 threads
>>> trying to obtain a connection at about the same time all of them end up
>>> contending for one lock. The problem is that I can't think of a
>>> different way to ensure the max limits (per route and total) are
>>> guaranteed not to be exceeded. If anyone can think of a better algorithm
>>> please do let me know. What might be a possibility is creating a more
>>> lenient and less prone to lock contention issues implementation that may
>>> under stress occasionally allocate a few more connections than the max
>>> limits.
>> 
>> I don't know if this has been resolved. My work-around from a few years ago was to rely on having multiple Hadoop reducers running on the server (each in their own JVM), where I could then limit each JVM to at most 300 connections.
>> 
> 
> I experimented with the idea of lock-less (unlimited) connection manager
> but in my tests it did not perform any better than the standard
> connection manager.

Previously I'd asked:

> Would it work to go for finer-grained locking, by using atomic counters to track & enforce limits on per route/total connections?

Any thoughts on that approach? E.g. have a map from route to atomic counter, and a single atomic counter for total connections?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr