You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Mugoma Joseph Okomba <mu...@yengas.com> on 2012/06/03 05:33:31 UTC

Getting resolved URL for a redirected link

Hello,

If a link is redirected in the process download, how does one get the
final URL from the response?

Thanks.

Mugoma Joseph.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: java.net.URISyntaxException: Illegal character in query

Posted by Ken Krugler <kk...@transpac.com>.
On Jun 4, 2012, at 5:41pm, Mugoma Joseph Okomba wrote:

> Hello,
> 
> While trying to use HttpClient 4.2 to download page I am getting:
> 
> java.net.URISyntaxException: Illegal character in query at index 85:
> http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories
> 
> 
> On HttpClient 3.x I get similar error:
> 
> java.lang.IllegalArgumentException: Invalid uri
> 'http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories':
> Invalid query
> 
> 
> However using the native Java download causes no error:
> 
> URL getURL = new URL(url);
> HttpURLConnection huc =  ( HttpURLConnection )  getURL.openConnection ();
> huc.setRequestMethod("GET");
> InputStream inps = null;
> 	try{
> 		huc.connect();
> 		inps = (InputStream) huc.getInputStream();
> 	}
> 
> 
> The URL is valid and accessible. How can one make HttpClient resolve such
> URL?

This issue is one that has come up on occasion in the past, where the Java.net URI class is more restrictive than the URL class, or most browsers, or most DNS software.

In your case it's failing because '|' (vertical bar) is not considered a valid character by Java's URI class (which is used internally by HttpClient), but it is OK for a URL. Which always struck me as odd, since most people talk about URLs being a subset of URIs :)

Going back in time, RFC1630 (T. Berners-Lee, CERN 1994) classifies the vertical bar (called "vline" in the spec) as a "national" character:

  national               { | } | vline | [ | ] | \ | ^ | ~

And then says:

The "national" and "punctuation" characters do not appear in any productions and therefore may not appear in URIs. So technically speaking the URI class is doing the right thing.

You'll run into a similar issue with subdomains that use '-', e.g. -angelcries.blogspot.com can be used to construct a URL, but not a URI.

Because DNS software & browsers are permissive, you'll find a number of these cases where web pages can't be fetched using HttpClient.

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378




--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





java.net.URISyntaxException: Illegal character in query

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
Hello,

While trying to use HttpClient 4.2 to download page I am getting:

java.net.URISyntaxException: Illegal character in query at index 85:
http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories


On HttpClient 3.x I get similar error:

java.lang.IllegalArgumentException: Invalid uri
'http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories':
Invalid query


However using the native Java download causes no error:

URL getURL = new URL(url);
HttpURLConnection huc =  ( HttpURLConnection )  getURL.openConnection ();
huc.setRequestMethod("GET");
InputStream inps = null;
	try{
		huc.connect();
		inps = (InputStream) huc.getInputStream();
	}


The URL is valid and accessible. How can one make HttpClient resolve such
URL?

Thanks.

Mugoma Joseph.





---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org