You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Jim <ji...@gmail.com> on 2010/09/08 20:42:12 UTC

Why doesn't httpclient follow redirects on this URL?

I found a URL that httpclient doesn't seem to be handling redirects on:

http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml

should 302 to:
http://www.mtv.com/news/articles/1647243/20100907/story.jhtml

when I look at the headers in the browser everything looks good:

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Location: http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
Content-Length: 258
Date: Wed, 08 Sep 2010 18:40:21 GMT
Expires: Wed, 08 Sep 2010 18:40:21 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 1; mode=block
Server: GSE
Set-Cookie:
PREF=ID=024209255b405b06:TM=1283971221:LM=1283971221:S=AG-13_7Cjg_EqlRY;
expires=Fri, 07-Sep-2012 18:40:21 GMT; path=/; domain=.google.com
Connection: close

However httpclient doesn't seem to give me the final URL. Here is the code I
was using



HttpHead httpget = null;
HttpHost target = null;
HttpUriRequest req = null;

String startURL = "
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
";
HttpContext localContext = new BasicHttpContext();
localContext.setAttribute(ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
httpget = new HttpHead(startURL);

HttpResponse response = httpClient.execute(httpget, localContext);

Header[] test = response.getAllHeaders();
for(Header h: test) {
logger.info(h.getName()+ ": "+h.getValue());
}

target = (HttpHost) localContext.getAttribute(
ExecutionContext.HTTP_TARGET_HOST );

req = (HttpUriRequest) localContext.getAttribute(
ExecutionContext.HTTP_REQUEST );

// STILL PRINTS OUT THE GOOGLE NEWS LINK
finalURL = target+""+req.getURI();



Am I doing something wrong? thanks

Re: Why doesn't httpclient follow redirects on this URL?

Posted by Jim <ji...@gmail.com>.
thanks Ken, I just tried that and still get the same results, the mtv url is
not the final url. Any other ideas on what could be wrong?

thanks



On Wed, Sep 8, 2010 at 12:00 PM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Sep 8, 2010, at 11:42am, Jim wrote:
>
>  I found a URL that httpclient doesn't seem to be handling redirects on:
>>
>>
>> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>>
>> should 302 to:
>> http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>>
>> when I look at the headers in the browser everything looks good:
>>
>> HTTP/1.1 302 Moved Temporarily
>> Content-Type: text/html; charset=UTF-8
>> Location: http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>> Content-Length: 258
>> Date: Wed, 08 Sep 2010 18:40:21 GMT
>> Expires: Wed, 08 Sep 2010 18:40:21 GMT
>> Cache-Control: private, max-age=0
>> X-Content-Type-Options: nosniff
>> X-Frame-Options: SAMEORIGIN
>> X-Xss-Protection: 1; mode=block
>> Server: GSE
>> Set-Cookie:
>> PREF=ID=024209255b405b06:TM=1283971221:LM=1283971221:S=AG-13_7Cjg_EqlRY;
>> expires=Fri, 07-Sep-2012 18:40:21 GMT; path=/; domain=.google.com
>> Connection: close
>>
>> However httpclient doesn't seem to give me the final URL. Here is the code
>> I
>> was using
>>
>>
>>
>> HttpHead httpget = null;
>> HttpHost target = null;
>> HttpUriRequest req = null;
>>
>> String startURL = "
>>
>> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>> ";
>> HttpContext localContext = new BasicHttpContext();
>>
>> localContext.setAttribute(ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
>> httpget = new HttpHead(startURL);
>>
>> HttpResponse response = httpClient.execute(httpget, localContext);
>>
>> Header[] test = response.getAllHeaders();
>> for(Header h: test) {
>> logger.info(h.getName()+ ": "+h.getValue());
>> }
>>
>> target = (HttpHost) localContext.getAttribute(
>> ExecutionContext.HTTP_TARGET_HOST );
>>
>> req = (HttpUriRequest) localContext.getAttribute(
>> ExecutionContext.HTTP_REQUEST );
>>
>> // STILL PRINTS OUT THE GOOGLE NEWS LINK
>> finalURL = target+""+req.getURI();
>>
>>
>>
>> Am I doing something wrong? thanks
>>
>
>
> I think you need to explicitly get the URI from the host, and then combine
> with the final request - or at least this code below is how I'm doing it
> (and it seems to work), Oleg can probably improve on it...
>
>        HttpHost host =
> (HttpHost)localContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
>        HttpUriRequest finalRequest =
> (HttpUriRequest)localContext.getAttribute(ExecutionContext.HTTP_REQUEST);
>
>        try {
>            URL hostUrl = new URI(host.toURI()).toURL();
>            return new URL(hostUrl,
> finalRequest.getURI().toString()).toExternalForm();
>
> -- Ken
>
> --------------------------------------------
> <http://ken-blog.krugler.org>
> +1 530-265-2225
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: Why doesn't httpclient follow redirects on this URL?

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 8, 2010, at 11:42am, Jim wrote:

> I found a URL that httpclient doesn't seem to be handling redirects  
> on:
>
> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>
> should 302 to:
> http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
>
> when I look at the headers in the browser everything looks good:
>
> HTTP/1.1 302 Moved Temporarily
> Content-Type: text/html; charset=UTF-8
> Location: http://www.mtv.com/news/articles/1647243/20100907/ 
> story.jhtml
> Content-Length: 258
> Date: Wed, 08 Sep 2010 18:40:21 GMT
> Expires: Wed, 08 Sep 2010 18:40:21 GMT
> Cache-Control: private, max-age=0
> X-Content-Type-Options: nosniff
> X-Frame-Options: SAMEORIGIN
> X-Xss-Protection: 1; mode=block
> Server: GSE
> Set-Cookie:
> PREF 
> =ID=024209255b405b06:TM=1283971221:LM=1283971221:S=AG-13_7Cjg_EqlRY;
> expires=Fri, 07-Sep-2012 18:40:21 GMT; path=/; domain=.google.com
> Connection: close
>
> However httpclient doesn't seem to give me the final URL. Here is  
> the code I
> was using
>
>
>
> HttpHead httpget = null;
> HttpHost target = null;
> HttpUriRequest req = null;
>
> String startURL = "
> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> ";
> HttpContext localContext = new BasicHttpContext();
> localContext 
> .setAttribute 
> (ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
> httpget = new HttpHead(startURL);
>
> HttpResponse response = httpClient.execute(httpget, localContext);
>
> Header[] test = response.getAllHeaders();
> for(Header h: test) {
> logger.info(h.getName()+ ": "+h.getValue());
> }
>
> target = (HttpHost) localContext.getAttribute(
> ExecutionContext.HTTP_TARGET_HOST );
>
> req = (HttpUriRequest) localContext.getAttribute(
> ExecutionContext.HTTP_REQUEST );
>
> // STILL PRINTS OUT THE GOOGLE NEWS LINK
> finalURL = target+""+req.getURI();
>
>
>
> Am I doing something wrong? thanks


I think you need to explicitly get the URI from the host, and then  
combine with the final request - or at least this code below is how  
I'm doing it (and it seems to work), Oleg can probably improve on it...

         HttpHost host =  
(HttpHost)localContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
         HttpUriRequest finalRequest =  
(HttpUriRequest 
)localContext.getAttribute(ExecutionContext.HTTP_REQUEST);

         try {
             URL hostUrl = new URI(host.toURI()).toURL();
             return new URL(hostUrl,  
finalRequest.getURI().toString()).toExternalForm();

-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Why doesn't httpclient follow redirects on this URL?

Posted by "Stephen J. Butler" <st...@gmail.com>.
On Wed, Sep 8, 2010 at 2:38 PM, Jim <ji...@gmail.com> wrote:
> Stephen is there a way to issue a GET without bringing back the actual HTML
> content? My main goal is just to get the final URL of what the link is and
> bringing back all the content will slow things down quite a bit.

Depends on what your application is. If you're always hitting Google
first, you could set ClientPNames.HANDLE_REDIRECTS to false, send
Google GET, do the redirect yourself, and then the redirect URL HEAD.

The risk is that whatever information you hope to get from HEAD isn't
correct. We've already seen one case where it fails. I have to say
that in all the webapps I've written, I don't think I've once thought
about HEAD requests.

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Why doesn't httpclient follow redirects on this URL?

Posted by Jim <ji...@gmail.com>.
Stephen is there a way to issue a GET without bringing back the actual HTML
content? My main goal is just to get the final URL of what the link is and
bringing back all the content will slow things down quite a bit.




On Wed, Sep 8, 2010 at 12:18 PM, Stephen J. Butler <stephen.butler@gmail.com
> wrote:

> On Wed, Sep 8, 2010 at 1:42 PM, Jim <ji...@gmail.com> wrote:
> > HttpHead httpget = null;
> > HttpHost target = null;
> > HttpUriRequest req = null;
> >
> > String startURL = "
> >
> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> > ";
> > HttpContext localContext = new BasicHttpContext();
> >
> localContext.setAttribute(ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
> > httpget = new HttpHead(startURL);
>
> There's your problem. Google doesn't respond to HEAD the same way as GET:
>
> $ nc news.google.com 80
> HEAD /news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=
> http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> HTTP/1.1
> Host: news.google.com
>
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=ISO-8859-1
> Set-Cookie:
> PREF=ID=c0dc77b54e3366b4:TM=1283973424:LM=1283973424:S=5gVyGhbFXF9WJ_WY;
> expires=Fri, 07-Sep-2012 19:17:04 GMT; path=/; domain=.google.com
> X-Content-Type-Options: nosniff
> Date: Wed, 08 Sep 2010 19:17:04 GMT
> Server: NFE/1.0
> Content-Length: 0
> X-XSS-Protection: 1; mode=block
> Expires: Wed, 08 Sep 2010 19:17:04 GMT
> Cache-Control: private
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>

Re: Why doesn't httpclient follow redirects on this URL?

Posted by Jim <ji...@gmail.com>.
Stephen, thank you ... you're correct now that I manually do a HEAD request.
I switched to HttpGet and that correctly follows the 302




On Wed, Sep 8, 2010 at 12:18 PM, Stephen J. Butler <stephen.butler@gmail.com
> wrote:

> On Wed, Sep 8, 2010 at 1:42 PM, Jim <ji...@gmail.com> wrote:
> > HttpHead httpget = null;
> > HttpHost target = null;
> > HttpUriRequest req = null;
> >
> > String startURL = "
> >
> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> > ";
> > HttpContext localContext = new BasicHttpContext();
> >
> localContext.setAttribute(ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
> > httpget = new HttpHead(startURL);
>
> There's your problem. Google doesn't respond to HEAD the same way as GET:
>
> $ nc news.google.com 80
> HEAD /news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=
> http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> HTTP/1.1
> Host: news.google.com
>
> HTTP/1.1 200 OK
> Content-Type: text/html; charset=ISO-8859-1
> Set-Cookie:
> PREF=ID=c0dc77b54e3366b4:TM=1283973424:LM=1283973424:S=5gVyGhbFXF9WJ_WY;
> expires=Fri, 07-Sep-2012 19:17:04 GMT; path=/; domain=.google.com
> X-Content-Type-Options: nosniff
> Date: Wed, 08 Sep 2010 19:17:04 GMT
> Server: NFE/1.0
> Content-Length: 0
> X-XSS-Protection: 1; mode=block
> Expires: Wed, 08 Sep 2010 19:17:04 GMT
> Cache-Control: private
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>
>

Re: Why doesn't httpclient follow redirects on this URL?

Posted by "Stephen J. Butler" <st...@gmail.com>.
On Wed, Sep 8, 2010 at 1:42 PM, Jim <ji...@gmail.com> wrote:
> HttpHead httpget = null;
> HttpHost target = null;
> HttpUriRequest req = null;
>
> String startURL = "
> http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
> ";
> HttpContext localContext = new BasicHttpContext();
> localContext.setAttribute(ClientContext.COOKIE_STORE,HttpClientFetcher.emptyCookieStore);
> httpget = new HttpHead(startURL);

There's your problem. Google doesn't respond to HEAD the same way as GET:

$ nc news.google.com 80
HEAD /news/url?sa=t&fd=R&usg=AFQjCNGrJk-F7Dmshmtze2yhifxRsv8sRg&url=http://www.mtv.com/news/articles/1647243/20100907/story.jhtml
HTTP/1.1
Host: news.google.com

HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=c0dc77b54e3366b4:TM=1283973424:LM=1283973424:S=5gVyGhbFXF9WJ_WY;
expires=Fri, 07-Sep-2012 19:17:04 GMT; path=/; domain=.google.com
X-Content-Type-Options: nosniff
Date: Wed, 08 Sep 2010 19:17:04 GMT
Server: NFE/1.0
Content-Length: 0
X-XSS-Protection: 1; mode=block
Expires: Wed, 08 Sep 2010 19:17:04 GMT
Cache-Control: private

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org