You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Mugoma Joseph Okomba <mu...@yengas.com> on 2012/05/04 13:28:26 UTC

Migrating from Commons HttpClient (3.x) to HttpComponents Client (4.x)

Hello,

I would like to migrate from HttpClient 3.x to HttpClient 4.x but having
difficulty how to handle redirects. The code works properly under Commons
HttpClient but breaks when migrated to HttpComponents Client. Some of the
links get undesirable redirects but when I set
"http.protocol.handle-redirects" to 'false' I get no result altogether for
some of the links.


Commons HttpClient 3.x code:

	private static HttpClient httpClient = null;
	private static MultiThreadedHttpConnectionManager connectionManager = null;
	private static final long MAX_CONNECTION_IDLE_TIME = 60000; // milliseconds

	static {
		//HttpURLConnection.setFollowRedirects(true);
		CookieManager manager = new CookieManager();
		manager.setCookiePolicy(CookiePolicy.ACCEPT_ALL);
		CookieHandler.setDefault(manager);

    connectionManager = new MultiThreadedHttpConnectionManager();
    connectionManager.getParams().setDefaultMaxConnectionsPerHost(1000);
// will need to set from properties file
    connectionManager.getParams().setMaxTotalConnections(1000);
    httpClient = new HttpClient(connectionManager);
	}




	/*
	* Retrieve HTML
	*/
	public String fetchURL(String url) throws IOException{

		if ( StringUtils.isEmpty(url) )
			return null;

		GetMethod getMethod = new GetMethod(url);
		//HttpClient httpClient = new HttpClient();
		//configureMethod(getMethod);
		//ObjectInputStream oin = null;
		InputStream in = null;
		int code = -1;
		String html = "";
		String lastModified = null;
		try {
		  code = httpClient.executeMethod(getMethod);

		  in = getMethod.getResponseBodyAsStream();
			//oin = new ObjectInputStream(in);
			//html = getMethod.getResponseBodyAsString();
			html = CharStreams.toString(new InputStreamReader(in));

		}


		catch (Exception except) {
		}
		finally {

		  try {
		  	//oin.close();
		  	in.close();
		  }
		  catch (Exception except) {}

		  getMethod.releaseConnection();
		  connectionManager.closeIdleConnections(MAX_CONNECTION_IDLE_TIME);
		}

		if (code <= 400){
			return html.replaceAll("\\s+", " ");
		} else {
			throw new Exception("URL: " + url + " returned response code " + code);
		}

	}




HttpComponents Client 4.x code:


	private static HttpClient httpClient = null;
	private static HttpParams params = null;
	//private static MultiThreadedHttpConnectionManager connectionManager =
null;
	private static ThreadSafeClientConnManager connectionManager = null;
	private static final int MAX_CONNECTION_IDLE_TIME = 60000; // milliseconds


	static {
		//HttpURLConnection.setFollowRedirects(true);
		CookieManager manager = new CookieManager();
		manager.setCookiePolicy(CookiePolicy.ACCEPT_ALL);
		CookieHandler.setDefault(manager);


    connectionManager = new ThreadSafeClientConnManager();
    connectionManager.setDefaultMaxPerRoute(1000); // will need to set
from properties file
    connectionManager.setMaxTotal(1000);
    httpClient = new DefaultHttpClient(connectionManager);



		// HTTP parameters stores header etc.
		params = new BasicHttpParams();
		params.setParameter("http.protocol.handle-redirects",false);

	}




	/*
	* Retrieve HTML
	*/
	public String fetchURL(String url) throws IOException{

		if ( StringUtils.isEmpty(url) )
			return null;

		InputStream in = null;
		//int code = -1;
		String html = "";

	 // Prepare a request object
	 HttpGet httpget = new HttpGet(url);
	httpget.setParams(params);

	 // Execute the request
	 HttpResponse response = httpClient.execute(httpget);

	 // The response status
	 //System.out.println(response.getStatusLine());
	int code = response.getStatusLine().getStatusCode();

	 // Get hold of the response entity
	 HttpEntity entity = response.getEntity();

	 // If the response does not enclose an entity, there is no need
	 // to worry about connection release
	 if (entity != null) {

			try {
				//code = httpClient.executeMethod(getMethod);

				//in = getMethod.getResponseBodyAsStream();
				in = entity.getContent();
				html = CharStreams.toString(new InputStreamReader(in));

			}


			catch (Exception except) {
				throw new Exception("URL: " + url + " returned response code " + code);
			}
			finally {

				try {
					//oin.close();
					in.close();
				}
				catch (Exception except) {}

				//getMethod.releaseConnection();
				connectionManager.closeIdleConnections(MAX_CONNECTION_IDLE_TIME,
TimeUnit.MILLISECONDS);
				connectionManager.closeExpiredConnections();
			}

		}

		if (code <= 400){
			return html;
		} else {
			throw new Exception("URL: " + url + " returned response code " + code);
		}


	}




I won't want redirects but under HttpClient 4.x if I enable redirects then
I get some that are undesirable, e.g.  http://www.walmart.com/ =>
http://mobile.walmart.com/. Under HttpClient 3.x no such redirects
happens.

What do I need to do to migrate HttpClient 3.x to HttpClient 4.x without
breaking the code?


Thanks in advance.

Mugoma.



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Migrating from Commons HttpClient (3.x) to HttpComponents Client (4.x)

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2012-05-04 at 20:21 +0300, Mugoma Joseph Okomba wrote:
> >
> > (1) Redirect handling in HC 3.x is utterly and irreparably broken. HC
> > 4.x does a much more reasonable job at handling redirects.
> >
> 
> Redirects under HC 3.x might be broken but it works. At least I haven't
> bumped into the broken nature
> 
> > (2) There is no such thing as undesirable redirects from the HTTP
> > protocol standpoint. Redirects are either legal, illegal or legal but
> > requiring user intervention. If you want to handle redirects selectively
> > (allowing some redirects but disallowing others) you can always
> > implement a custom RedirectStrategy and configure HttpClient to use it
> > instead of the default one.
> >
> 
> The problem with HC 4.x is that it's being detected as a mobile browser.
> So when it hits a website the server tries to redirect to a mobile version
> if it has. This appears to be the anormally for me.
> 
> Can such behavior be avoided / rectified?
> 

This problem is very like to have nothing to do with redirects. Some
sites react differently to requests with different 'User-Agent' headers.
Some might be aware of HttpClient being an HTTP library rather than a
browser. Some might be not.

If you configure HC 4.x to masquerade itself as HC 3.x the problematic
sites are likely to behave more consistently.

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Migrating from Commons HttpClient (3.x) to HttpComponents Client (4.x)

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
>
> (1) Redirect handling in HC 3.x is utterly and irreparably broken. HC
> 4.x does a much more reasonable job at handling redirects.
>

Redirects under HC 3.x might be broken but it works. At least I haven't
bumped into the broken nature

> (2) There is no such thing as undesirable redirects from the HTTP
> protocol standpoint. Redirects are either legal, illegal or legal but
> requiring user intervention. If you want to handle redirects selectively
> (allowing some redirects but disallowing others) you can always
> implement a custom RedirectStrategy and configure HttpClient to use it
> instead of the default one.
>

The problem with HC 4.x is that it's being detected as a mobile browser.
So when it hits a website the server tries to redirect to a mobile version
if it has. This appears to be the anormally for me.

Can such behavior be avoided / rectified?

Thanks.

Mugoma.



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: HC 4: Excluding images ang other types of content

Posted by William Speirs <ws...@apache.org>.
By default if you point HC4 at a web page it will only download the
HTML. You'd have to parse that HTML and extract all the links to get
the images, JavaScript, etc.

Give it a try...

Bill-

On Fri, May 11, 2012 at 1:41 PM, Mugoma Joseph Okomba <mu...@yengas.com> wrote:
> Hello,
>
> I am using HC 4 to download web page. Since I am only interested in the
> text of the web page I would like to exclude images and other content such
> as javascript, css, etc
>
> Is there a way to do this in HttClient?
>
> Thanks.
>
> Mugoma Joseph.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


HC 4: Excluding images ang other types of content

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
Hello,

I am using HC 4 to download web page. Since I am only interested in the
text of the web page I would like to exclude images and other content such
as javascript, css, etc

Is there a way to do this in HttClient?

Thanks.

Mugoma Joseph.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Migrating from Commons HttpClient (3.x) to HttpComponents Client (4.x)

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2012-05-04 at 14:28 +0300, Mugoma Joseph Okomba wrote:
> Hello,
> 
> I would like to migrate from HttpClient 3.x to HttpClient 4.x but having
> difficulty how to handle redirects. The code works properly under Commons
> HttpClient but breaks when migrated to HttpComponents Client. Some of the
> links get undesirable redirects but when I set
> "http.protocol.handle-redirects" to 'false' I get no result altogether for
> some of the links.
> 
...
> 
> I won't want redirects but under HttpClient 4.x if I enable redirects then
> I get some that are undesirable, e.g.  http://www.walmart.com/ =>
> http://mobile.walmart.com/. Under HttpClient 3.x no such redirects
> happens.
> 
> What do I need to do to migrate HttpClient 3.x to HttpClient 4.x without
> breaking the code?
> 
> 
> Thanks in advance.
> 
> Mugoma.
> 

(1) Redirect handling in HC 3.x is utterly and irreparably broken. HC
4.x does a much more reasonable job at handling redirects.

(2) There is no such thing as undesirable redirects from the HTTP
protocol standpoint. Redirects are either legal, illegal or legal but
requiring user intervention. If you want to handle redirects selectively
(allowing some redirects but disallowing others) you can always
implement a custom RedirectStrategy and configure HttpClient to use it
instead of the default one.

Oleg  



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org