You are viewing a plain text version of this content. The canonical link for it is here.

Posted to httpclient-users@hc.apache.org by "Jens Mueller supidupi007@googlemail.com" <su...@googlemail.com> on 2010/01/27 20:42:21 UTC

Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Hello HC Experts,

I would be very greatful for an advice regarding my question. I already
spend a lot of time searching the internet, but I am still have not found an
example that answers my questions. There are lot of examples available (also
for the multithreaded use-cases) but the only adress the use-case making
one(!!) request. I am completely uncertain how to "best" make a series of
requests (to the same webserver).

I need to develop a simple Crawler that crawls some websites for specific
information. The Basic idea is to download the single webpages of a website
(for example www.a.com) sequentially but run several of these "sequential"
downloaders in threads for different webpages (www.b.com and www.c.com) in
parallel.

My current concept/implementation looks like this:

1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
parameters). This connection Manager will be used/shared by all
"DefaultHttpClient's"s
2.  For every Webpage (of a Website, with multiple webpages), I Instanciate
for every(!!) webpage-request a new DefaultHttpClient and then call the
"httpClient.execute(httpGet)" method with the instanciated GetMethod(url).

==> I am more and more wondering if this is the correct usage of the
DefaultHttpClient and the .execute() Method. Am I doing something wrong
here, to instanciate a new DefaultHttpClient for every request of a wepage?
Or should I rather instanciate only one(!!) DefaultHttpClient and then share
this for the sequential .execute() calls?

To be honest, what I also have not really understood yet is the Cookie
Management. Do I as the Programmer have to instanciate the CookieStore
manually
1. httpClient.setCookieStore(new BasicCookieStore());
and then after calling the .execute() method "get" the Cookie store
2. savedcookies = httpClient.getCookieStore()
and then reinject this cookie store for the next call to the same wepage (to
maintain state)?
3. httpClient.setCookie(savedcookies)
Or is there some implicit magic that A) does create the cookie store
implicitly and B) somehow shares this CookieStore among the HttpClients
and/or HttpGet's?

Thank you very much!!
Jens

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Sat, 2010-01-30 at 23:05 +0100, Jens Mueller
supidupi007@googlemail.com wrote:
> Hello Oleg, hello Ken, hello Sam,
> 
> thank your very much for your help!!!
> 
> Please allow me to ask one further question. In case the DefaultHttpClient
> would be used on a "website-basis" (that is, I create a new Instance of the
> DefaultHttpClient for downloading a specific website (www.a.com) and then
> create a new DefaultHttpClient for a second website (www.b.com) and the
> DefaultHttpClient is used with the ThreadSafeClientConnManager, do I have to
> somehow  explicitly shutdown the DefaultHttpClient? (The JavaDoc states,
> that when the DefaultHttpClient is used with NO explicitly set Connection
> Manager, then getConnectionManager().shutdown() sould be called, as it
> implicitly creates a SimpleConnectionManager). But is my assumption correct,
> that when I use the TSCCM (with the DefaultHttpClient) that I then do not
> have to do anything at all to leak any ressources (when I no longer require
> the DefaultHttpClient instance). It seems that HttpClient is a very
> heavy-object and maybe there are other resources I have to manually
> "free/shutdown"?

You ought to be using a single instance of DefaultHttpClient /
ThreadSafeClientConnManager per distinct HTTP service (say, one for web
service communication, one for web crawling, and so on). There is no
reason for creating it for each and every target host.


> 
> (I very much appreciate your help and I started to refactor my application.
> I then however had to realize that I have the requirement to have a
> decidated UserAgent for every website I crawl. Using a "Shared
> DefaultHttpClient" (one Instance for the whole application ) with dedicated
> HttpContexts per Website/Thread doesn't work, as I sadly can't set the
> UserAgent on the HttpContext level.

This is correct, but you can always add a custom protocol interceptor
that overrides the default User-Agent header based on an attribute of
the actual HTTP context, such as the name of the target host or some
other custom value.

http://hc.apache.org/httpcomponents-client/tutorial/html/fundamentals.html#protocol_interceptors

>  The UserAgent only seems to be settable
> on the HttpClient or HttpMethod Level. I dont know would this be a
> reasonable feature request/suggestion to also allow HttpParams to be set on
> the HttpContext level that then will take precidence over all other (already
> specified) paramters?

An additional lookup for each and every parameter would have a negative
impact on performance. You should use a custom protocol interceptor to
override parameters that are relevant for your application.

Hope this helps

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by "Jens Mueller supidupi007@googlemail.com" <su...@googlemail.com>.

Hello Oleg, hello Ken, hello Sam,

thank your very much for your help!!!

Please allow me to ask one further question. In case the DefaultHttpClient
would be used on a "website-basis" (that is, I create a new Instance of the
DefaultHttpClient for downloading a specific website (www.a.com) and then
create a new DefaultHttpClient for a second website (www.b.com) and the
DefaultHttpClient is used with the ThreadSafeClientConnManager, do I have to
somehow  explicitly shutdown the DefaultHttpClient? (The JavaDoc states,
that when the DefaultHttpClient is used with NO explicitly set Connection
Manager, then getConnectionManager().shutdown() sould be called, as it
implicitly creates a SimpleConnectionManager). But is my assumption correct,
that when I use the TSCCM (with the DefaultHttpClient) that I then do not
have to do anything at all to leak any ressources (when I no longer require
the DefaultHttpClient instance). It seems that HttpClient is a very
heavy-object and maybe there are other resources I have to manually
"free/shutdown"?

(I very much appreciate your help and I started to refactor my application.
I then however had to realize that I have the requirement to have a
decidated UserAgent for every website I crawl. Using a "Shared
DefaultHttpClient" (one Instance for the whole application ) with dedicated
HttpContexts per Website/Thread doesn't work, as I sadly can't set the
UserAgent on the HttpContext level. The UserAgent only seems to be settable
on the HttpClient or HttpMethod Level. I dont know would this be a
reasonable feature request/suggestion to also allow HttpParams to be set on
the HttpContext level that then will take precidence over all other (already
specified) paramters?

Thank you very much!
Jens

2010/1/28 Oleg Kalnichevski <ol...@apache.org>

> On Wed, 2010-01-27 at 20:42 +0100, Jens Mueller
> supidupi007@googlemail.com wrote:
> > Hello HC Experts,
> >
> > I would be very greatful for an advice regarding my question. I already
> > spend a lot of time searching the internet, but I am still have not found
> an
> > example that answers my questions. There are lot of examples available
> (also
> > for the multithreaded use-cases) but the only adress the use-case making
> > one(!!) request. I am completely uncertain how to "best" make a series of
> > requests (to the same webserver).
> >
> > I need to develop a simple Crawler that crawls some websites for specific
> > information. The Basic idea is to download the single webpages of a
> website
> > (for example www.a.com) sequentially but run several of these
> "sequential"
> > downloaders in threads for different webpages (www.b.com and www.c.com)
> in
> > parallel.
> >
> > My current concept/implementation looks like this:
> >
> > 1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
> > parameters). This connection Manager will be used/shared by all
> > "DefaultHttpClient's"s
> > 2.  For every Webpage (of a Website, with multiple webpages), I
> Instanciate
> > for every(!!) webpage-request a new DefaultHttpClient and then call the
> > "httpClient.execute(httpGet)" method with the instanciated
> GetMethod(url).
> >
> > ==> I am more and more wondering if this is the correct usage of the
> > DefaultHttpClient and the .execute() Method. Am I doing something wrong
> > here, to instanciate a new DefaultHttpClient for every request of a
> wepage?
> > Or should I rather instanciate only one(!!) DefaultHttpClient and then
> share
> > this for the sequential .execute() calls?
> >
> > To be honest, what I also have not really understood yet is the Cookie
> > Management. Do I as the Programmer have to instanciate the CookieStore
> > manually
> > 1. httpClient.setCookieStore(new BasicCookieStore());
> > and then after calling the .execute() method "get" the Cookie store
> > 2. savedcookies = httpClient.getCookieStore()
> > and then reinject this cookie store for the next call to the same wepage
> (to
> > maintain state)?
> > 3. httpClient.setCookie(savedcookies)
> > Or is there some implicit magic that A) does create the cookie store
> > implicitly and B) somehow shares this CookieStore among the HttpClients
> > and/or HttpGet's?
> >
> > Thank you very much!!
> > Jens
>
> Jens,
>
> Re-use HttpClient instance for all execution threads but create a
> separate HttpContext and CookieStore per thread of execution /
> individual user, as described by Ken.
>
> Oleg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Wed, 2010-01-27 at 20:42 +0100, Jens Mueller
supidupi007@googlemail.com wrote:
> Hello HC Experts,
> 
> I would be very greatful for an advice regarding my question. I already
> spend a lot of time searching the internet, but I am still have not found an
> example that answers my questions. There are lot of examples available (also
> for the multithreaded use-cases) but the only adress the use-case making
> one(!!) request. I am completely uncertain how to "best" make a series of
> requests (to the same webserver).
> 
> I need to develop a simple Crawler that crawls some websites for specific
> information. The Basic idea is to download the single webpages of a website
> (for example www.a.com) sequentially but run several of these "sequential"
> downloaders in threads for different webpages (www.b.com and www.c.com) in
> parallel.
> 
> My current concept/implementation looks like this:
> 
> 1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
> parameters). This connection Manager will be used/shared by all
> "DefaultHttpClient's"s
> 2.  For every Webpage (of a Website, with multiple webpages), I Instanciate
> for every(!!) webpage-request a new DefaultHttpClient and then call the
> "httpClient.execute(httpGet)" method with the instanciated GetMethod(url).
> 
> ==> I am more and more wondering if this is the correct usage of the
> DefaultHttpClient and the .execute() Method. Am I doing something wrong
> here, to instanciate a new DefaultHttpClient for every request of a wepage?
> Or should I rather instanciate only one(!!) DefaultHttpClient and then share
> this for the sequential .execute() calls?
> 
> To be honest, what I also have not really understood yet is the Cookie
> Management. Do I as the Programmer have to instanciate the CookieStore
> manually
> 1. httpClient.setCookieStore(new BasicCookieStore());
> and then after calling the .execute() method "get" the Cookie store
> 2. savedcookies = httpClient.getCookieStore()
> and then reinject this cookie store for the next call to the same wepage (to
> maintain state)?
> 3. httpClient.setCookie(savedcookies)
> Or is there some implicit magic that A) does create the cookie store
> implicitly and B) somehow shares this CookieStore among the HttpClients
> and/or HttpGet's?
> 
> Thank you very much!!
> Jens

Jens,

Re-use HttpClient instance for all execution threads but create a
separate HttpContext and CookieStore per thread of execution /
individual user, as described by Ken.

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by Sam Crawford <sa...@gmail.com>.

Ah yes, that makes sense.

In my scenario I'm using HttpClient as the client-side of a reverse
proxy, and therefore can't use a single context per server (as we have
multiple users accessing backend servers simultaneously, so their
cookies get all mixed up).

Thanks,

Sam


2010/1/27 Ken Krugler <kk...@transpac.com>:
> You can create a local context and use that for all requests to the same
> server. This then lets you re-use the same HttpClient, which is how you want
> to handle this (versus creating new instances for each domain).
>
> For example, in Bixo's SimpleHttpFetcher there's this code:
>
>            getter = new HttpGet(new URI(url));
>
>            // Create a local instance of cookie store, and bind to local
> context
>            // Without this we get killed w/lots of threads, due to sync() on
> single cookie store.
>            HttpContext localContext = new BasicHttpContext();
>            CookieStore cookieStore = new BasicCookieStore();
>            localContext.setAttribute(ClientContext.COOKIE_STORE,
> cookieStore);
>            response = _httpClient.execute(getter, localContext);
>
> The call to execute the GET request uses the localContext, which is what I
> think Jens want.
>
> -- Ken
>
>
> On Jan 27, 2010, at 3:22pm, Sam Crawford wrote:
>
>> I could well be mistaken, but my experience suggests that with version
>> 4.0 you need a new HttpClient each time you deal with a different set
>> of cookies. Creating multiple HttpContexts used across a single
>> DefaultHttpClient instance did not seem to be sufficient.
>>
>> That said, I only tried this briefly and didn't spend a huge amount of
>> time investigating it. I keep meaning to do so and to submit a bug if
>> I find a genuinely reproducible issue.
>>
>> Thanks,
>>
>> Sam
>>
>>
>> 2010/1/27 Jens Mueller supidupi007@googlemail.com
>> <su...@googlemail.com>:
>>>
>>> Hello HC Experts,
>>>
>>> I would be very greatful for an advice regarding my question. I already
>>> spend a lot of time searching the internet, but I am still have not found
>>> an
>>> example that answers my questions. There are lot of examples available
>>> (also
>>> for the multithreaded use-cases) but the only adress the use-case making
>>> one(!!) request. I am completely uncertain how to "best" make a series of
>>> requests (to the same webserver).
>>>
>>> I need to develop a simple Crawler that crawls some websites for specific
>>> information. The Basic idea is to download the single webpages of a
>>> website
>>> (for example www.a.com) sequentially but run several of these
>>> "sequential"
>>> downloaders in threads for different webpages (www.b.com and www.c.com)
>>> in
>>> parallel.
>>>
>>> My current concept/implementation looks like this:
>>>
>>> 1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
>>> parameters). This connection Manager will be used/shared by all
>>> "DefaultHttpClient's"s
>>> 2.  For every Webpage (of a Website, with multiple webpages), I
>>> Instanciate
>>> for every(!!) webpage-request a new DefaultHttpClient and then call the
>>> "httpClient.execute(httpGet)" method with the instanciated
>>> GetMethod(url).
>>>
>>> ==> I am more and more wondering if this is the correct usage of the
>>> DefaultHttpClient and the .execute() Method. Am I doing something wrong
>>> here, to instanciate a new DefaultHttpClient for every request of a
>>> wepage?
>>> Or should I rather instanciate only one(!!) DefaultHttpClient and then
>>> share
>>> this for the sequential .execute() calls?
>>>
>>> To be honest, what I also have not really understood yet is the Cookie
>>> Management. Do I as the Programmer have to instanciate the CookieStore
>>> manually
>>> 1. httpClient.setCookieStore(new BasicCookieStore());
>>> and then after calling the .execute() method "get" the Cookie store
>>> 2. savedcookies = httpClient.getCookieStore()
>>> and then reinject this cookie store for the next call to the same wepage
>>> (to
>>> maintain state)?
>>> 3. httpClient.setCookie(savedcookies)
>>> Or is there some implicit magic that A) does create the cookie store
>>> implicitly and B) somehow shares this CookieStore among the HttpClients
>>> and/or HttpGet's?
>>>
>>> Thank you very much!!
>>> Jens
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by Ken Krugler <kk...@transpac.com>.

You can create a local context and use that for all requests to the  
same server. This then lets you re-use the same HttpClient, which is  
how you want to handle this (versus creating new instances for each  
domain).

For example, in Bixo's SimpleHttpFetcher there's this code:

             getter = new HttpGet(new URI(url));

             // Create a local instance of cookie store, and bind to  
local context
             // Without this we get killed w/lots of threads, due to  
sync() on single cookie store.
             HttpContext localContext = new BasicHttpContext();
             CookieStore cookieStore = new BasicCookieStore();
             localContext.setAttribute(ClientContext.COOKIE_STORE,  
cookieStore);
             response = _httpClient.execute(getter, localContext);

The call to execute the GET request uses the localContext, which is  
what I think Jens want.

-- Ken


On Jan 27, 2010, at 3:22pm, Sam Crawford wrote:

> I could well be mistaken, but my experience suggests that with version
> 4.0 you need a new HttpClient each time you deal with a different set
> of cookies. Creating multiple HttpContexts used across a single
> DefaultHttpClient instance did not seem to be sufficient.
>
> That said, I only tried this briefly and didn't spend a huge amount of
> time investigating it. I keep meaning to do so and to submit a bug if
> I find a genuinely reproducible issue.
>
> Thanks,
>
> Sam
>
>
> 2010/1/27 Jens Mueller supidupi007@googlemail.com <supidupi007@googlemail.com 
> >:
>> Hello HC Experts,
>>
>> I would be very greatful for an advice regarding my question. I  
>> already
>> spend a lot of time searching the internet, but I am still have not  
>> found an
>> example that answers my questions. There are lot of examples  
>> available (also
>> for the multithreaded use-cases) but the only adress the use-case  
>> making
>> one(!!) request. I am completely uncertain how to "best" make a  
>> series of
>> requests (to the same webserver).
>>
>> I need to develop a simple Crawler that crawls some websites for  
>> specific
>> information. The Basic idea is to download the single webpages of a  
>> website
>> (for example www.a.com) sequentially but run several of these  
>> "sequential"
>> downloaders in threads for different webpages (www.b.com and www.c.com 
>> ) in
>> parallel.
>>
>> My current concept/implementation looks like this:
>>
>> 1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
>> parameters). This connection Manager will be used/shared by all
>> "DefaultHttpClient's"s
>> 2.  For every Webpage (of a Website, with multiple webpages), I  
>> Instanciate
>> for every(!!) webpage-request a new DefaultHttpClient and then call  
>> the
>> "httpClient.execute(httpGet)" method with the instanciated  
>> GetMethod(url).
>>
>> ==> I am more and more wondering if this is the correct usage of the
>> DefaultHttpClient and the .execute() Method. Am I doing something  
>> wrong
>> here, to instanciate a new DefaultHttpClient for every request of a  
>> wepage?
>> Or should I rather instanciate only one(!!) DefaultHttpClient and  
>> then share
>> this for the sequential .execute() calls?
>>
>> To be honest, what I also have not really understood yet is the  
>> Cookie
>> Management. Do I as the Programmer have to instanciate the  
>> CookieStore
>> manually
>> 1. httpClient.setCookieStore(new BasicCookieStore());
>> and then after calling the .execute() method "get" the Cookie store
>> 2. savedcookies = httpClient.getCookieStore()
>> and then reinject this cookie store for the next call to the same  
>> wepage (to
>> maintain state)?
>> 3. httpClient.setCookie(savedcookies)
>> Or is there some implicit magic that A) does create the cookie store
>> implicitly and B) somehow shares this CookieStore among the  
>> HttpClients
>> and/or HttpGet's?
>>
>> Thank you very much!!
>> Jens
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Best-Practices for Multithreaded use of HttpClient (with Cookies)?

Posted by Sam Crawford <sa...@gmail.com>.

I could well be mistaken, but my experience suggests that with version
4.0 you need a new HttpClient each time you deal with a different set
of cookies. Creating multiple HttpContexts used across a single
DefaultHttpClient instance did not seem to be sufficient.

That said, I only tried this briefly and didn't spend a huge amount of
time investigating it. I keep meaning to do so and to submit a bug if
I find a genuinely reproducible issue.

Thanks,

Sam


2010/1/27 Jens Mueller supidupi007@googlemail.com <su...@googlemail.com>:
> Hello HC Experts,
>
> I would be very greatful for an advice regarding my question. I already
> spend a lot of time searching the internet, but I am still have not found an
> example that answers my questions. There are lot of examples available (also
> for the multithreaded use-cases) but the only adress the use-case making
> one(!!) request. I am completely uncertain how to "best" make a series of
> requests (to the same webserver).
>
> I need to develop a simple Crawler that crawls some websites for specific
> information. The Basic idea is to download the single webpages of a website
> (for example www.a.com) sequentially but run several of these "sequential"
> downloaders in threads for different webpages (www.b.com and www.c.com) in
> parallel.
>
> My current concept/implementation looks like this:
>
> 1.  Instanciate a ThreadSafeClientConnManager (with a lot of default
> parameters). This connection Manager will be used/shared by all
> "DefaultHttpClient's"s
> 2.  For every Webpage (of a Website, with multiple webpages), I Instanciate
> for every(!!) webpage-request a new DefaultHttpClient and then call the
> "httpClient.execute(httpGet)" method with the instanciated GetMethod(url).
>
> ==> I am more and more wondering if this is the correct usage of the
> DefaultHttpClient and the .execute() Method. Am I doing something wrong
> here, to instanciate a new DefaultHttpClient for every request of a wepage?
> Or should I rather instanciate only one(!!) DefaultHttpClient and then share
> this for the sequential .execute() calls?
>
> To be honest, what I also have not really understood yet is the Cookie
> Management. Do I as the Programmer have to instanciate the CookieStore
> manually
> 1. httpClient.setCookieStore(new BasicCookieStore());
> and then after calling the .execute() method "get" the Cookie store
> 2. savedcookies = httpClient.getCookieStore()
> and then reinject this cookie store for the next call to the same wepage (to
> maintain state)?
> 3. httpClient.setCookie(savedcookies)
> Or is there some implicit magic that A) does create the cookie store
> implicitly and B) somehow shares this CookieStore among the HttpClients
> and/or HttpGet's?
>
> Thank you very much!!
> Jens
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org