You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Mugoma Joseph Okomba <mu...@yengas.com> on 2012/05/29 02:26:06 UTC

Getting a compressed version of web page

Hello,

I have setup URL download:
HttpGet request = new HttpGet(url);
request.addHeader("Accept-Encoding", "gzip,deflate");

response.getFirstHeader("Content-Encoding") shows "Content-Encoding: gzip"

However, entity.getContentEncoding() is null.

If I put:
entity = new GzipDecompressingEntity(entity);

I get:
java.io.IOException: Not in GZIP format

It looks like the resulting page is plain text and not compressed even
though "Content-Encoding" header shows it's gzip.

I have tried this on several URLs (from different websites) but get same
results.

How can I get compressed version of web page? I am using HC 4.1

Thanks in advance.

Mugoma Joseph.




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: java.net.URISyntaxException: Illegal character in query

Posted by Ken Krugler <kk...@transpac.com>.
On Jun 4, 2012, at 5:41pm, Mugoma Joseph Okomba wrote:

> Hello,
> 
> While trying to use HttpClient 4.2 to download page I am getting:
> 
> java.net.URISyntaxException: Illegal character in query at index 85:
> http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories
> 
> 
> On HttpClient 3.x I get similar error:
> 
> java.lang.IllegalArgumentException: Invalid uri
> 'http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories':
> Invalid query
> 
> 
> However using the native Java download causes no error:
> 
> URL getURL = new URL(url);
> HttpURLConnection huc =  ( HttpURLConnection )  getURL.openConnection ();
> huc.setRequestMethod("GET");
> InputStream inps = null;
> 	try{
> 		huc.connect();
> 		inps = (InputStream) huc.getInputStream();
> 	}
> 
> 
> The URL is valid and accessible. How can one make HttpClient resolve such
> URL?

This issue is one that has come up on occasion in the past, where the Java.net URI class is more restrictive than the URL class, or most browsers, or most DNS software.

In your case it's failing because '|' (vertical bar) is not considered a valid character by Java's URI class (which is used internally by HttpClient), but it is OK for a URL. Which always struck me as odd, since most people talk about URLs being a subset of URIs :)

Going back in time, RFC1630 (T. Berners-Lee, CERN 1994) classifies the vertical bar (called "vline" in the spec) as a "national" character:

  national               { | } | vline | [ | ] | \ | ^ | ~

And then says:

The "national" and "punctuation" characters do not appear in any productions and therefore may not appear in URIs. So technically speaking the URI class is doing the right thing.

You'll run into a similar issue with subdomains that use '-', e.g. -angelcries.blogspot.com can be used to construct a URL, but not a URI.

Because DNS software & browsers are permissive, you'll find a number of these cases where web pages can't be fetched using HttpClient.

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378




--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





java.net.URISyntaxException: Illegal character in query

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
Hello,

While trying to use HttpClient 4.2 to download page I am getting:

java.net.URISyntaxException: Illegal character in query at index 85:
http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories


On HttpClient 3.x I get similar error:

java.lang.IllegalArgumentException: Invalid uri
'http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories':
Invalid query


However using the native Java download causes no error:

URL getURL = new URL(url);
HttpURLConnection huc =  ( HttpURLConnection )  getURL.openConnection ();
huc.setRequestMethod("GET");
InputStream inps = null;
	try{
		huc.connect();
		inps = (InputStream) huc.getInputStream();
	}


The URL is valid and accessible. How can one make HttpClient resolve such
URL?

Thanks.

Mugoma Joseph.





---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Getting resolved URL for a redirected link

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
Hello,

If a link is redirected in the process download, how does one get the
final URL from the response?

Thanks.

Mugoma Joseph.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Getting a compressed version of web page

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2012-05-31 at 02:07 +0300, Mugoma Joseph Okomba wrote:
> Thanks for the suggestions. I moved to HC 4.2 and tested several links and
> compression appears to work. The only time compression fails is when the
> URL redirects.
> 
> e.g.
> 
> a) curl -v -H "Accept-Encoding: gzip"
> http://timesofindia.indiatimes.com/nri/us-canada-news/Indian-diplomats-daughter-files-lawsuit-seeks-1-5-million-in-damages/articleshow/13047591.cms
> > test-it.gzip
> 
> vs
> 
> b) curl -v -H "Accept-Encoding: gzip"
> http://articles.timesofindia.indiatimes.com/2012-05-08/us-canada-news/31626225_1_obscene-emails-lawsuit-krittika-biswas
> > test-it-b.gzip
> 
> 
> Note that the link in a) redirects to the link in b). Only the final URL
> (in b) benefits from compression.
> 
> Is it possible that compression applies even for redirected URLs?
> 
> Thanks.
> 
> Joseph.
> 

Hi Joseph 

This is a bug in HttpClient 4.2

https://issues.apache.org/jira/browse/HTTPCLIENT-1199

While we are working on the fix for it you can make DefaultHttpClient
capable of handling compressed responses transparently by explicitly
adding two additional protocol interceptors

DefaultHttpClient client = new DefaultHttpClient();
client.addRequestInterceptor(new RequestAcceptEncoding());
client.addResponseInterceptor(new ResponseContentEncoding());

This is all it actually takes.

Oleg




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Getting a compressed version of web page

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
Thanks for the suggestions. I moved to HC 4.2 and tested several links and
compression appears to work. The only time compression fails is when the
URL redirects.

e.g.

a) curl -v -H "Accept-Encoding: gzip"
http://timesofindia.indiatimes.com/nri/us-canada-news/Indian-diplomats-daughter-files-lawsuit-seeks-1-5-million-in-damages/articleshow/13047591.cms
> test-it.gzip

vs

b) curl -v -H "Accept-Encoding: gzip"
http://articles.timesofindia.indiatimes.com/2012-05-08/us-canada-news/31626225_1_obscene-emails-lawsuit-krittika-biswas
> test-it-b.gzip


Note that the link in a) redirects to the link in b). Only the final URL
(in b) benefits from compression.

Is it possible that compression applies even for redirected URLs?

Thanks.

Joseph.

On Tue, May 29, 2012 4:39 pm, Sam Crawford wrote:
> A couple of suggestions:
>
> 1. Confirm with cURL that the website is definitely providing gzip'd
> data (you should output the content to a file, and then open it with a
> text editor): curl -v -H "Accept-Encoding: gzip" http://foo.bar.com >
> output.txt
> 2. Consider using the example provided at
> http://hc.apache.org/httpcomponents-client-ga/examples.html under
> "Custom protocol interceptors". This will handle gzip for you
> transparently.
>
> Thanks,
>
> Sam


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Getting a compressed version of web page

Posted by Sam Crawford <sa...@gmail.com>.
A couple of suggestions:

1. Confirm with cURL that the website is definitely providing gzip'd
data (you should output the content to a file, and then open it with a
text editor): curl -v -H "Accept-Encoding: gzip" http://foo.bar.com >
output.txt
2. Consider using the example provided at
http://hc.apache.org/httpcomponents-client-ga/examples.html under
"Custom protocol interceptors". This will handle gzip for you
transparently.

Thanks,

Sam


On 29 May 2012 12:18, Mugoma Joseph Okomba <mu...@yengas.com> wrote:
>
> I have no control over the websites, so I can't tell if they're setup to
> send compressed pages. But I was assuming that:
> 1. From a sample of, say, 20 sites at least 1 will give a compressed page
> 2. If a site doesn't have compressed version then the return encoding
> shouldn't be gzip
>
>
> On Tue, May 29, 2012 12:41 pm, William Speirs wrote:
>> Is the site setup to send a compressed version?
>>
>> Bill-
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Getting a compressed version of web page

Posted by Mugoma Joseph Okomba <mu...@yengas.com>.
I have no control over the websites, so I can't tell if they're setup to
send compressed pages. But I was assuming that:
1. From a sample of, say, 20 sites at least 1 will give a compressed page
2. If a site doesn't have compressed version then the return encoding
shouldn't be gzip


On Tue, May 29, 2012 12:41 pm, William Speirs wrote:
> Is the site setup to send a compressed version?
>
> Bill-



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Getting a compressed version of web page

Posted by William Speirs <ws...@apache.org>.
Is the site setup to send a compressed version?

Bill-
On May 28, 2012 8:26 PM, "Mugoma Joseph Okomba" <mu...@yengas.com> wrote:

> Hello,
>
> I have setup URL download:
> HttpGet request = new HttpGet(url);
> request.addHeader("Accept-Encoding", "gzip,deflate");
>
> response.getFirstHeader("Content-Encoding") shows "Content-Encoding: gzip"
>
> However, entity.getContentEncoding() is null.
>
> If I put:
> entity = new GzipDecompressingEntity(entity);
>
> I get:
> java.io.IOException: Not in GZIP format
>
> It looks like the resulting page is plain text and not compressed even
> though "Content-Encoding" header shows it's gzip.
>
> I have tried this on several URLs (from different websites) but get same
> results.
>
> How can I get compressed version of web page? I am using HC 4.1
>
> Thanks in advance.
>
> Mugoma Joseph.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>