You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Dvora <ba...@gmail.com> on 2012/01/23 20:36:08 UTC

Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Hi,

I would like to code an high performance web crawler using httpclient 4.1.2.
In order to bring the machine to highest throughput, each crawling thread
creating a DefaultHttpClient with a pool configured as follow (based on one
of the examples):

static
	{
		cm = new ThreadSafeClientConnManager();
		cm.setMaxTotal( 50000 );
		cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );

		HttpClient client = new DefaultHttpClient();

		params = client.getParams();

		HttpClientParams.setRedirecting( params, false );
		HttpClientParams.setAuthenticating( params, true );

		HttpConnectionParams.setSoTimeout( params, 30000 );
		HttpConnectionParams.setConnectionTimeout( params, 30000 );

		IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( cm );

		connEvictor.start();
	}

When running the application with lots of crawling threads, netstat show
only 2k tcp connections in status ESTABLISHED. Is this expected considering
maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking the
application to reach more than 2k tcp connections?

Thanks.


-- 
View this message in context: http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33190497.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Dvora <ba...@gmail.com>.
Of course, forgot to mention that detail… :-)

The client instanciation is done per thread as follow:

HttpClient client = new DefaultHttpClient( cm, params );

Where cm and params were initialized in the static block.



dcheckoway wrote:
> 
> You may want to pass cm to the DefaultHttpClient constructor...
> 
> On Mon, Jan 23, 2012 at 2:36 PM, Dvora <ba...@gmail.com> wrote:
> 
>>
>> Hi,
>>
>> I would like to code an high performance web crawler using httpclient
>> 4.1.2.
>> In order to bring the machine to highest throughput, each crawling thread
>> creating a DefaultHttpClient with a pool configured as follow (based on
>> one
>> of the examples):
>>
>> static
>>        {
>>                cm = new ThreadSafeClientConnManager();
>>                cm.setMaxTotal( 50000 );
>>                cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
>>
>>                HttpClient client = new DefaultHttpClient();
>>
>>                params = client.getParams();
>>
>>                HttpClientParams.setRedirecting( params, false );
>>                HttpClientParams.setAuthenticating( params, true );
>>
>>                HttpConnectionParams.setSoTimeout( params, 30000 );
>>                HttpConnectionParams.setConnectionTimeout( params, 30000
>> );
>>
>>                IdleConnectionEvictor connEvictor = new
>> IdleConnectionEvictor( cm );
>>
>>                connEvictor.start();
>>        }
>>
>> When running the application with lots of crawling threads, netstat show
>> only 2k tcp connections in status ESTABLISHED. Is this expected
>> considering
>> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking
>> the
>> application to reach more than 2k tcp connections?
>>
>> Thanks.
>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33190497.html
>> Sent from the HttpClient-User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33193107.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Dan Checkoway <dc...@gmail.com>.
You may want to pass cm to the DefaultHttpClient constructor...

On Mon, Jan 23, 2012 at 2:36 PM, Dvora <ba...@gmail.com> wrote:

>
> Hi,
>
> I would like to code an high performance web crawler using httpclient
> 4.1.2.
> In order to bring the machine to highest throughput, each crawling thread
> creating a DefaultHttpClient with a pool configured as follow (based on one
> of the examples):
>
> static
>        {
>                cm = new ThreadSafeClientConnManager();
>                cm.setMaxTotal( 50000 );
>                cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
>
>                HttpClient client = new DefaultHttpClient();
>
>                params = client.getParams();
>
>                HttpClientParams.setRedirecting( params, false );
>                HttpClientParams.setAuthenticating( params, true );
>
>                HttpConnectionParams.setSoTimeout( params, 30000 );
>                HttpConnectionParams.setConnectionTimeout( params, 30000 );
>
>                IdleConnectionEvictor connEvictor = new
> IdleConnectionEvictor( cm );
>
>                connEvictor.start();
>        }
>
> When running the application with lots of crawling threads, netstat show
> only 2k tcp connections in status ESTABLISHED. Is this expected considering
> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking the
> application to reach more than 2k tcp connections?
>
> Thanks.
>
>
> --
> View this message in context:
> http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33190497.html
> Sent from the HttpClient-User mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>

Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, 2012-01-24 at 23:15 -0800, Dvora wrote:
> 
> 
> olegk wrote:
> > 
> > 
> > How exactly did you measure that?
> > 
> > 
> 
> I'm watching cacti graphs describing what is going on with the eth0, as I
> said - the inbound never crossing the 2mb/sec. But I guess, I'll try to find
> out what are the limiting factors.
> 
> Thanks.
> 

This number is meaningless because there is no way of telling whether
the local host is not fast enough to saturate the bandwidth or the local
host cannot generate enough requests to saturate the bandwidth because
the remove hosts do not deliver enough input fast enough. I am quite
sure it is the latter.

Oleg  


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Dvora <ba...@gmail.com>.


olegk wrote:
> 
> 
> How exactly did you measure that?
> 
> 

I'm watching cacti graphs describing what is going on with the eth0, as I
said - the inbound never crossing the 2mb/sec. But I guess, I'll try to find
out what are the limiting factors.

Thanks.

-- 
View this message in context: http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33199521.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, 2012-01-24 at 12:28 -0800, Dvora wrote:
> Hmm, any idea why?
> 
> Anyway, if I may use this thread, can you suggest an optimal architecture
> for crawling using httpclient? 

I am not really qualified to make such recommendations as I personally
never used Httpclient for web crawling. However, as far as I know there
are several open-source web crawler implementations based on HttpClient
which you might consider making use of instead of writing your own from
scratch.

> What is the best way (beside using lots of
> worker threads, which I do now) to download maximum web pages in minimum
> time, and better utilizing the bandwidth (now it's never crossing the
> 2Mb/sec) ?
> 

How exactly did you measure that?

When running against a local web service HttpClient can generate the
highest request per second ratio out of all HTTP clients benchmarked
[1]. That makes me doubt that HttpClient is the bottleneck.

Oleg

[1]
http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore

> Thanks.
> 
> 
> 
> olegk wrote:
> > 
> > On Mon, 2012-01-23 at 11:36 -0800, Dvora wrote:
> >> Hi,
> >> 
> >> I would like to code an high performance web crawler using httpclient
> >> 4.1.2.
> >> In order to bring the machine to highest throughput, each crawling thread
> >> creating a DefaultHttpClient with a pool configured as follow (based on
> >> one
> >> of the examples):
> >> 
> >> static
> >> 	{
> >> 		cm = new ThreadSafeClientConnManager();
> >> 		cm.setMaxTotal( 50000 );
> >> 		cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
> >> 
> >> 		HttpClient client = new DefaultHttpClient();
> >> 
> >> 		params = client.getParams();
> >> 
> >> 		HttpClientParams.setRedirecting( params, false );
> >> 		HttpClientParams.setAuthenticating( params, true );
> >> 
> >> 		HttpConnectionParams.setSoTimeout( params, 30000 );
> >> 		HttpConnectionParams.setConnectionTimeout( params, 30000 );
> >> 
> >> 		IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( cm );
> >> 
> >> 		connEvictor.start();
> >> 	}
> >> 
> >> When running the application with lots of crawling threads, netstat show
> >> only 2k tcp connections in status ESTABLISHED. Is this expected
> >> considering
> >> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking
> >> the
> >> application to reach more than 2k tcp connections?
> >> 
> >> Thanks.
> >> 
> >> 
> > 
> > I personally think this is to be expected. When running performance
> > stress tests with 200 threads and 200 max connections limit I frequently
> > observe HttpClient utilizing significantly fewer connections (~100)
> > never ever reaching the max limit.  
> > 
> > Oleg   
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> > 
> > 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Dvora <ba...@gmail.com>.
Hmm, any idea why?

Anyway, if I may use this thread, can you suggest an optimal architecture
for crawling using httpclient? What is the best way (beside using lots of
worker threads, which I do now) to download maximum web pages in minimum
time, and better utilizing the bandwidth (now it's never crossing the
2Mb/sec) ?

Thanks.



olegk wrote:
> 
> On Mon, 2012-01-23 at 11:36 -0800, Dvora wrote:
>> Hi,
>> 
>> I would like to code an high performance web crawler using httpclient
>> 4.1.2.
>> In order to bring the machine to highest throughput, each crawling thread
>> creating a DefaultHttpClient with a pool configured as follow (based on
>> one
>> of the examples):
>> 
>> static
>> 	{
>> 		cm = new ThreadSafeClientConnManager();
>> 		cm.setMaxTotal( 50000 );
>> 		cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
>> 
>> 		HttpClient client = new DefaultHttpClient();
>> 
>> 		params = client.getParams();
>> 
>> 		HttpClientParams.setRedirecting( params, false );
>> 		HttpClientParams.setAuthenticating( params, true );
>> 
>> 		HttpConnectionParams.setSoTimeout( params, 30000 );
>> 		HttpConnectionParams.setConnectionTimeout( params, 30000 );
>> 
>> 		IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( cm );
>> 
>> 		connEvictor.start();
>> 	}
>> 
>> When running the application with lots of crawling threads, netstat show
>> only 2k tcp connections in status ESTABLISHED. Is this expected
>> considering
>> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking
>> the
>> application to reach more than 2k tcp connections?
>> 
>> Thanks.
>> 
>> 
> 
> I personally think this is to be expected. When running performance
> stress tests with 200 threads and 200 max connections limit I frequently
> observe HttpClient utilizing significantly fewer connections (~100)
> never ever reaching the max limit.  
> 
> Oleg   
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Understanding-how-ThreadSafeClientConnManager-parameters-affect-number-of-tcp-connections-tp33190497p33197498.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Understanding how ThreadSafeClientConnManager parameters affect number of tcp connections

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Mon, 2012-01-23 at 11:36 -0800, Dvora wrote:
> Hi,
> 
> I would like to code an high performance web crawler using httpclient 4.1.2.
> In order to bring the machine to highest throughput, each crawling thread
> creating a DefaultHttpClient with a pool configured as follow (based on one
> of the examples):
> 
> static
> 	{
> 		cm = new ThreadSafeClientConnManager();
> 		cm.setMaxTotal( 50000 );
> 		cm.setDefaultMaxPerRoute( Integer.MAX_VALUE );
> 
> 		HttpClient client = new DefaultHttpClient();
> 
> 		params = client.getParams();
> 
> 		HttpClientParams.setRedirecting( params, false );
> 		HttpClientParams.setAuthenticating( params, true );
> 
> 		HttpConnectionParams.setSoTimeout( params, 30000 );
> 		HttpConnectionParams.setConnectionTimeout( params, 30000 );
> 
> 		IdleConnectionEvictor connEvictor = new IdleConnectionEvictor( cm );
> 
> 		connEvictor.start();
> 	}
> 
> When running the application with lots of crawling threads, netstat show
> only 2k tcp connections in status ESTABLISHED. Is this expected considering
> maxTotsl = 50000? Are there other bottlenecks (OS level, etc.) blocking the
> application to reach more than 2k tcp connections?
> 
> Thanks.
> 
> 

I personally think this is to be expected. When running performance
stress tests with 200 threads and 200 max connections limit I frequently
observe HttpClient utilizing significantly fewer connections (~100)
never ever reaching the max limit.  

Oleg   



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org