You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Uncle <un...@gmail.com> on 2012/03/24 13:50:48 UTC

Trying to follow 301 redirects results in 404 error

Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.

I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:

        String url = "http://bit.ly/GGviSv";
        HttpGet httpget = new HttpGet(url);
        HttpContext context = new BasicHttpContext();
        HttpClient httpclient = new DefaultHttpClient();

        HttpResponse response = httpclient.execute(httpget, context);

        RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();

        log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
        for(Header header : response.getAllHeaders())
            log.info("header: " + header);

        log.info("status = " + response.getStatusLine());

outputs:

isRedirected = false
header: Server: nginx
header: Date: Sat, 24 Mar 2012 12:38:43 GMT
header: Content-Type: text/html; charset=UTF-8                                                                                                                          
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Vary: Cookie
header: X-CF-Powered-By: WP 1.2.0
header: X-Pingback: http://lavamagazine.com/xmlrpc.php
header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
header: Cache-Control: no-cache, must-revalidate, max-age=0
header: Pragma: no-cache
status = HTTP/1.1 404 Not Found

I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:

        HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
        HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
        String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
        httpclient.getConnectionManager().shutdown();
        log.info("Redirected URL = " + currentUrl);

This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.

Thanks for any assistance.

Randy


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Trying to follow 301 redirects results in 404 error

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sun, 2012-03-25 at 14:19 -0400, Uncle wrote:
> > It is not HttpClient reporting a wrong response status. It is the server
> > behaving incorrectly. I get the same 404 when accessing the location
> > directly.
> 
> What do you mean "directly"?
> 

Without redirect.

> > The problem is that the server does not correctly handle URI
> > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> > state how fragments in redirect locations should be handled. So, in my
> > opinion it is a server side issue. 
> 
> In my opinion, if 5 clients (HttpURLConnection, HttpClient, Chrome, Safari, Firefox) try to hit the URL, and 4 of them do so successfully and one does not, the issue is with the one client, not with the server.  Many URL's are poorly formed or ambiguous, yet most clients take extra steps to access them, which makes them more useful. 

HttpClient is not a browser but you are certainly entitled to have a
different opinion. 

>  I think that HttpClient should either do that or provide facilities for doing so.
> 

It does. One can handle redirects differently by implementing a custom
RedirectStrategy and rewriting malformed redirect URIs in a way which is
acceptable in the context of a specific application 

> > The URL has illegal character(s), which is the reason why the redirect
> > fails. 
> 
> The Java toolkit and browsers URLEncode the URL, which avoids this problem. This seems like a good general approach when redirecting.
> 

See above.

Oleg

> Randy
> 
> On Mar 24, 2012, at 7:59 PM, Oleg Kalnichevski wrote:
> 
> > On Sat, 2012-03-24 at 16:46 -0400, Uncle wrote:
> >> On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote:
> >> 
> >>> On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
> >>>> Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.
> >>>> 
> >>>> I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:
> >>>> 
> >>>>       String url = "http://bit.ly/GGviSv";
> >>>>       HttpGet httpget = new HttpGet(url);
> >>>>       HttpContext context = new BasicHttpContext();
> >>>>       HttpClient httpclient = new DefaultHttpClient();
> >>>> 
> >>>>       HttpResponse response = httpclient.execute(httpget, context);
> >>>> 
> >>>>       RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
> >>>> 
> >>>>       log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
> >>>>       for(Header header : response.getAllHeaders())
> >>>>           log.info("header: " + header);
> >>>> 
> >>>>       log.info("status = " + response.getStatusLine());
> >>>> 
> >>>> outputs:
> >>>> 
> >>>> isRedirected = false
> >>>> header: Server: nginx
> >>>> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
> >>>> header: Content-Type: text/html; charset=UTF-8                                                                                                                          
> >>>> header: Transfer-Encoding: chunked
> >>>> header: Connection: keep-alive
> >>>> header: Vary: Cookie
> >>>> header: X-CF-Powered-By: WP 1.2.0
> >>>> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
> >>>> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
> >>>> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
> >>>> header: Cache-Control: no-cache, must-revalidate, max-age=0
> >>>> header: Pragma: no-cache
> >>>> status = HTTP/1.1 404 Not Found
> >>>> 
> >>>> I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:
> >>>> 
> >>>>       HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
> >>>>       HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
> >>>>       String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
> >>>>       httpclient.getConnectionManager().shutdown();
> >>>>       log.info("Redirected URL = " + currentUrl);
> >>>> 
> >>>> This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.
> >>>> 
> >>>> Thanks for any assistance.
> >>>> 
> >>>> Randy
> >>>> 
> >>> 
> >>> As far as I can tell HttpClient correctly redirects to the new location,
> >>> but the resource is simply no longer there.
> >>> 
> >>> [DEBUG] headers - >> GET /GGviSv HTTP/1.1
> >>> [DEBUG] headers - >> Host: bit.ly
> >>> [DEBUG] headers - >> Connection: Keep-Alive
> >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> >>> (java 1.5)
> >>> [DEBUG] headers - << HTTP/1.1 301 Moved
> >>> [DEBUG] headers - << Server: nginx
> >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
> >>> [DEBUG] headers - << Content-Type: text/html; charset=utf-8
> >>> [DEBUG] headers - << Connection: keep-alive
> >>> [DEBUG] headers - << Set-Cookie:
> >>> _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
> >>> 18:46:44 2012;path=/; HttpOnly
> >>> [DEBUG] headers - << Cache-control: private; max-age=90
> >>> [DEBUG] headers - << Location:
> >>> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >>> [DEBUG] headers - << MIME-Version: 1.0
> >>> [DEBUG] headers - << Content-Length: 185
> >>> [DEBUG] headers - >>
> >>> GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> >>> [DEBUG] headers - >> Host: lavamagazine.com
> >>> [DEBUG] headers - >> Connection: Keep-Alive
> >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> >>> (java 1.5)
> >>> [DEBUG] headers - << HTTP/1.1 404 Not Found
> >>> [DEBUG] headers - << Server: nginx
> >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
> >>> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> >>> [DEBUG] headers - << Transfer-Encoding: chunked
> >>> [DEBUG] headers - << Connection: keep-alive
> >>> [DEBUG] headers - << Vary: Cookie
> >>> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> >>> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> >>> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> >>> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
> >>> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> >>> [DEBUG] headers - << Pragma: no-cache
> >>> 
> >>> Oleg
> >>> 
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> >>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> >>> 
> >> 
> >> Yet, if you hit the URL: 
> >> 
> >> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >> 
> >> with your browser, the content comes up fine.  
> >> 
> >> Hitting the redirect URL with the standard Java HttpURLConnetion class does not produce the 404:
> >> 
> >>       String url = "http://bit.ly/GGviSv";
> >>        URL urlObj = new URL(url);
> >>        HttpURLConnection urlConnection = (HttpURLConnection)urlObj.openConnection();
> >>        urlConnection.setRequestMethod("GET");
> >>        urlConnection.setConnectTimeout(15000);
> >>        urlConnection.setReadTimeout(30000);
> >>        urlConnection.connect();
> >>        log.info("Response code = " + urlConnection.getResponseCode());
> >>        InputStream inputStream = urlConnection.getInputStream();
> >>        log.info("Redirected URL = " + urlConnection.getURL().toString());
> >> 
> >> This outputs:
> >> 
> >> Response code = 200
> >> Redirected URL = http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> >> 
> >> So HttpClient reports a 404, but HttpURLConnection reports a 200 and my browsers (Safari, Chrome, and FireFox) all hit the link fine.
> >> 
> > 
> > It is not HttpClient reporting a wrong response status. It is the server
> > behaving incorrectly. I get the same 404 when accessing the location
> > directly. The problem is that the server does not correctly handle URI
> > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> > state how fragments in redirect locations should be handled. So, in my
> > opinion it is a server side issue. 
> > 
> > You can work the problem around by using a custom redirect strategy and
> > rewrites redirect location and strips away the fragment if present.
> > 
> > [DEBUG] headers - >>
> > GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> > [DEBUG] headers - >> Host: lavamagazine.com
> > [DEBUG] headers - >> Connection: Keep-Alive
> > [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> > (java 1.5)
> > [DEBUG] headers - << HTTP/1.1 404 Not Found
> > [DEBUG] headers - << Server: nginx
> > [DEBUG] headers - << Date: Sat, 24 Mar 2012 23:31:10 GMT
> > [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> > [DEBUG] headers - << Transfer-Encoding: chunked
> > [DEBUG] headers - << Connection: keep-alive
> > [DEBUG] headers - << Vary: Cookie
> > [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> > [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> > [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> > [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 23:31:10 GMT
> > [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> > [DEBUG] headers - << Pragma: no-cache
> > 
> > 
> >> Here is another URL that is problematic:
> >> 
> >> http://on.wsj.com/GHGlfS
> >> 
> >> this produces:
> >> 
> >> org.apache.http.client.ClientProtocolException
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
> >> ... snip ...
> >> Caused by: org.apache.http.ProtocolException: Invalid redirect URI: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193)
> >> 	at org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035)
> >> 	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492)
> >> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
> >> 	... 28 more
> >> Caused by: java.net.URISyntaxException: Illegal character in path at index 72: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
> >> 	at java.net.URI$Parser.fail(URI.java:2809)
> >> 	at java.net.URI$Parser.checkChars(URI.java:2982)
> >> 	at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> >> 	at java.net.URI$Parser.parse(URI.java:3014)
> >> 	at java.net.URI.<init>(URI.java:578)
> >> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183)
> >> 	... 33 more
> >> 
> >> The redirected URL has a special character in it (single quote), and the client doesn't handle that.  The Java code that I pasted above produces
> >> 
> > 
> > The URL has illegal character(s), which is the reason why the redirect
> > fails. 
> > 
> > Oleg
> > 
> >> Response code = 200
> >> Redirected URL = http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw
> >> 
> >> Randy
> >> 
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> >> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> >> 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Trying to follow 301 redirects results in 404 error

Posted by Uncle <un...@gmail.com>.
> It is not HttpClient reporting a wrong response status. It is the server
> behaving incorrectly. I get the same 404 when accessing the location
> directly.

What do you mean "directly"?

> The problem is that the server does not correctly handle URI
> fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> state how fragments in redirect locations should be handled. So, in my
> opinion it is a server side issue. 

In my opinion, if 5 clients (HttpURLConnection, HttpClient, Chrome, Safari, Firefox) try to hit the URL, and 4 of them do so successfully and one does not, the issue is with the one client, not with the server.  Many URL's are poorly formed or ambiguous, yet most clients take extra steps to access them, which makes them more useful.  I think that HttpClient should either do that or provide facilities for doing so.

> The URL has illegal character(s), which is the reason why the redirect
> fails. 

The Java toolkit and browsers URLEncode the URL, which avoids this problem. This seems like a good general approach when redirecting.

Randy

On Mar 24, 2012, at 7:59 PM, Oleg Kalnichevski wrote:

> On Sat, 2012-03-24 at 16:46 -0400, Uncle wrote:
>> On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote:
>> 
>>> On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
>>>> Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.
>>>> 
>>>> I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:
>>>> 
>>>>       String url = "http://bit.ly/GGviSv";
>>>>       HttpGet httpget = new HttpGet(url);
>>>>       HttpContext context = new BasicHttpContext();
>>>>       HttpClient httpclient = new DefaultHttpClient();
>>>> 
>>>>       HttpResponse response = httpclient.execute(httpget, context);
>>>> 
>>>>       RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
>>>> 
>>>>       log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
>>>>       for(Header header : response.getAllHeaders())
>>>>           log.info("header: " + header);
>>>> 
>>>>       log.info("status = " + response.getStatusLine());
>>>> 
>>>> outputs:
>>>> 
>>>> isRedirected = false
>>>> header: Server: nginx
>>>> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
>>>> header: Content-Type: text/html; charset=UTF-8                                                                                                                          
>>>> header: Transfer-Encoding: chunked
>>>> header: Connection: keep-alive
>>>> header: Vary: Cookie
>>>> header: X-CF-Powered-By: WP 1.2.0
>>>> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
>>>> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
>>>> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
>>>> header: Cache-Control: no-cache, must-revalidate, max-age=0
>>>> header: Pragma: no-cache
>>>> status = HTTP/1.1 404 Not Found
>>>> 
>>>> I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:
>>>> 
>>>>       HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
>>>>       HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
>>>>       String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
>>>>       httpclient.getConnectionManager().shutdown();
>>>>       log.info("Redirected URL = " + currentUrl);
>>>> 
>>>> This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.
>>>> 
>>>> Thanks for any assistance.
>>>> 
>>>> Randy
>>>> 
>>> 
>>> As far as I can tell HttpClient correctly redirects to the new location,
>>> but the resource is simply no longer there.
>>> 
>>> [DEBUG] headers - >> GET /GGviSv HTTP/1.1
>>> [DEBUG] headers - >> Host: bit.ly
>>> [DEBUG] headers - >> Connection: Keep-Alive
>>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
>>> (java 1.5)
>>> [DEBUG] headers - << HTTP/1.1 301 Moved
>>> [DEBUG] headers - << Server: nginx
>>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
>>> [DEBUG] headers - << Content-Type: text/html; charset=utf-8
>>> [DEBUG] headers - << Connection: keep-alive
>>> [DEBUG] headers - << Set-Cookie:
>>> _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
>>> 18:46:44 2012;path=/; HttpOnly
>>> [DEBUG] headers - << Cache-control: private; max-age=90
>>> [DEBUG] headers - << Location:
>>> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
>>> [DEBUG] headers - << MIME-Version: 1.0
>>> [DEBUG] headers - << Content-Length: 185
>>> [DEBUG] headers - >>
>>> GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
>>> [DEBUG] headers - >> Host: lavamagazine.com
>>> [DEBUG] headers - >> Connection: Keep-Alive
>>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
>>> (java 1.5)
>>> [DEBUG] headers - << HTTP/1.1 404 Not Found
>>> [DEBUG] headers - << Server: nginx
>>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
>>> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
>>> [DEBUG] headers - << Transfer-Encoding: chunked
>>> [DEBUG] headers - << Connection: keep-alive
>>> [DEBUG] headers - << Vary: Cookie
>>> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
>>> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
>>> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
>>> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
>>> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
>>> [DEBUG] headers - << Pragma: no-cache
>>> 
>>> Oleg
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>>> 
>> 
>> Yet, if you hit the URL: 
>> 
>> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
>> 
>> with your browser, the content comes up fine.  
>> 
>> Hitting the redirect URL with the standard Java HttpURLConnetion class does not produce the 404:
>> 
>>       String url = "http://bit.ly/GGviSv";
>>        URL urlObj = new URL(url);
>>        HttpURLConnection urlConnection = (HttpURLConnection)urlObj.openConnection();
>>        urlConnection.setRequestMethod("GET");
>>        urlConnection.setConnectTimeout(15000);
>>        urlConnection.setReadTimeout(30000);
>>        urlConnection.connect();
>>        log.info("Response code = " + urlConnection.getResponseCode());
>>        InputStream inputStream = urlConnection.getInputStream();
>>        log.info("Redirected URL = " + urlConnection.getURL().toString());
>> 
>> This outputs:
>> 
>> Response code = 200
>> Redirected URL = http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
>> 
>> So HttpClient reports a 404, but HttpURLConnection reports a 200 and my browsers (Safari, Chrome, and FireFox) all hit the link fine.
>> 
> 
> It is not HttpClient reporting a wrong response status. It is the server
> behaving incorrectly. I get the same 404 when accessing the location
> directly. The problem is that the server does not correctly handle URI
> fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
> state how fragments in redirect locations should be handled. So, in my
> opinion it is a server side issue. 
> 
> You can work the problem around by using a custom redirect strategy and
> rewrites redirect location and strips away the fragment if present.
> 
> [DEBUG] headers - >>
> GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> [DEBUG] headers - >> Host: lavamagazine.com
> [DEBUG] headers - >> Connection: Keep-Alive
> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> (java 1.5)
> [DEBUG] headers - << HTTP/1.1 404 Not Found
> [DEBUG] headers - << Server: nginx
> [DEBUG] headers - << Date: Sat, 24 Mar 2012 23:31:10 GMT
> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> [DEBUG] headers - << Transfer-Encoding: chunked
> [DEBUG] headers - << Connection: keep-alive
> [DEBUG] headers - << Vary: Cookie
> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 23:31:10 GMT
> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> [DEBUG] headers - << Pragma: no-cache
> 
> 
>> Here is another URL that is problematic:
>> 
>> http://on.wsj.com/GHGlfS
>> 
>> this produces:
>> 
>> org.apache.http.client.ClientProtocolException
>> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
>> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>> ... snip ...
>> Caused by: org.apache.http.ProtocolException: Invalid redirect URI: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
>> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185)
>> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116)
>> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193)
>> 	at org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035)
>> 	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492)
>> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>> 	... 28 more
>> Caused by: java.net.URISyntaxException: Illegal character in path at index 72: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
>> 	at java.net.URI$Parser.fail(URI.java:2809)
>> 	at java.net.URI$Parser.checkChars(URI.java:2982)
>> 	at java.net.URI$Parser.parseHierarchical(URI.java:3066)
>> 	at java.net.URI$Parser.parse(URI.java:3014)
>> 	at java.net.URI.<init>(URI.java:578)
>> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183)
>> 	... 33 more
>> 
>> The redirected URL has a special character in it (single quote), and the client doesn't handle that.  The Java code that I pasted above produces
>> 
> 
> The URL has illegal character(s), which is the reason why the redirect
> fails. 
> 
> Oleg
> 
>> Response code = 200
>> Redirected URL = http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw
>> 
>> Randy
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
>> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Trying to follow 301 redirects results in 404 error

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2012-03-24 at 16:46 -0400, Uncle wrote:
> On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote:
> 
> > On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
> >> Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.
> >> 
> >> I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:
> >> 
> >>        String url = "http://bit.ly/GGviSv";
> >>        HttpGet httpget = new HttpGet(url);
> >>        HttpContext context = new BasicHttpContext();
> >>        HttpClient httpclient = new DefaultHttpClient();
> >> 
> >>        HttpResponse response = httpclient.execute(httpget, context);
> >> 
> >>        RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
> >> 
> >>        log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
> >>        for(Header header : response.getAllHeaders())
> >>            log.info("header: " + header);
> >> 
> >>        log.info("status = " + response.getStatusLine());
> >> 
> >> outputs:
> >> 
> >> isRedirected = false
> >> header: Server: nginx
> >> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
> >> header: Content-Type: text/html; charset=UTF-8                                                                                                                          
> >> header: Transfer-Encoding: chunked
> >> header: Connection: keep-alive
> >> header: Vary: Cookie
> >> header: X-CF-Powered-By: WP 1.2.0
> >> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
> >> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
> >> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
> >> header: Cache-Control: no-cache, must-revalidate, max-age=0
> >> header: Pragma: no-cache
> >> status = HTTP/1.1 404 Not Found
> >> 
> >> I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:
> >> 
> >>        HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
> >>        HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
> >>        String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
> >>        httpclient.getConnectionManager().shutdown();
> >>        log.info("Redirected URL = " + currentUrl);
> >> 
> >> This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.
> >> 
> >> Thanks for any assistance.
> >> 
> >> Randy
> >> 
> > 
> > As far as I can tell HttpClient correctly redirects to the new location,
> > but the resource is simply no longer there.
> > 
> > [DEBUG] headers - >> GET /GGviSv HTTP/1.1
> > [DEBUG] headers - >> Host: bit.ly
> > [DEBUG] headers - >> Connection: Keep-Alive
> > [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> > (java 1.5)
> > [DEBUG] headers - << HTTP/1.1 301 Moved
> > [DEBUG] headers - << Server: nginx
> > [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
> > [DEBUG] headers - << Content-Type: text/html; charset=utf-8
> > [DEBUG] headers - << Connection: keep-alive
> > [DEBUG] headers - << Set-Cookie:
> > _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
> > 18:46:44 2012;path=/; HttpOnly
> > [DEBUG] headers - << Cache-control: private; max-age=90
> > [DEBUG] headers - << Location:
> > http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> > [DEBUG] headers - << MIME-Version: 1.0
> > [DEBUG] headers - << Content-Length: 185
> > [DEBUG] headers - >>
> > GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> > [DEBUG] headers - >> Host: lavamagazine.com
> > [DEBUG] headers - >> Connection: Keep-Alive
> > [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> > (java 1.5)
> > [DEBUG] headers - << HTTP/1.1 404 Not Found
> > [DEBUG] headers - << Server: nginx
> > [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
> > [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> > [DEBUG] headers - << Transfer-Encoding: chunked
> > [DEBUG] headers - << Connection: keep-alive
> > [DEBUG] headers - << Vary: Cookie
> > [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> > [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> > [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> > [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
> > [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> > [DEBUG] headers - << Pragma: no-cache
> > 
> > Oleg
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> > For additional commands, e-mail: httpclient-users-help@hc.apache.org
> > 
> 
> Yet, if you hit the URL: 
> 
> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> 
> with your browser, the content comes up fine.  
> 
> Hitting the redirect URL with the standard Java HttpURLConnetion class does not produce the 404:
> 
>        String url = "http://bit.ly/GGviSv";
>         URL urlObj = new URL(url);
>         HttpURLConnection urlConnection = (HttpURLConnection)urlObj.openConnection();
>         urlConnection.setRequestMethod("GET");
>         urlConnection.setConnectTimeout(15000);
>         urlConnection.setReadTimeout(30000);
>         urlConnection.connect();
>         log.info("Response code = " + urlConnection.getResponseCode());
>         InputStream inputStream = urlConnection.getInputStream();
>         log.info("Redirected URL = " + urlConnection.getURL().toString());
> 
> This outputs:
> 
> Response code = 200
> Redirected URL = http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> 
> So HttpClient reports a 404, but HttpURLConnection reports a 200 and my browsers (Safari, Chrome, and FireFox) all hit the link fine.
> 

It is not HttpClient reporting a wrong response status. It is the server
behaving incorrectly. I get the same 404 when accessing the location
directly. The problem is that the server does not correctly handle URI
fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly
state how fragments in redirect locations should be handled. So, in my
opinion it is a server side issue. 

You can work the problem around by using a custom redirect strategy and
rewrites redirect location and strips away the fragment if present.

[DEBUG] headers - >>
GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
[DEBUG] headers - >> Host: lavamagazine.com
[DEBUG] headers - >> Connection: Keep-Alive
[DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
(java 1.5)
[DEBUG] headers - << HTTP/1.1 404 Not Found
[DEBUG] headers - << Server: nginx
[DEBUG] headers - << Date: Sat, 24 Mar 2012 23:31:10 GMT
[DEBUG] headers - << Content-Type: text/html; charset=UTF-8
[DEBUG] headers - << Transfer-Encoding: chunked
[DEBUG] headers - << Connection: keep-alive
[DEBUG] headers - << Vary: Cookie
[DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
[DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
[DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
[DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 23:31:10 GMT
[DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
[DEBUG] headers - << Pragma: no-cache


> Here is another URL that is problematic:
> 
> http://on.wsj.com/GHGlfS
> 
> this produces:
> 
> org.apache.http.client.ClientProtocolException
> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
> ... snip ...
> Caused by: org.apache.http.ProtocolException: Invalid redirect URI: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185)
> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116)
> 	at org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193)
> 	at org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035)
> 	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492)
> 	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
> 	... 28 more
> Caused by: java.net.URISyntaxException: Illegal character in path at index 72: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
> 	at java.net.URI$Parser.fail(URI.java:2809)
> 	at java.net.URI$Parser.checkChars(URI.java:2982)
> 	at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> 	at java.net.URI$Parser.parse(URI.java:3014)
> 	at java.net.URI.<init>(URI.java:578)
> 	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183)
> 	... 33 more
> 
> The redirected URL has a special character in it (single quote), and the client doesn't handle that.  The Java code that I pasted above produces
> 

The URL has illegal character(s), which is the reason why the redirect
fails. 

Oleg

> Response code = 200
> Redirected URL = http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw
> 
> Randy
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Trying to follow 301 redirects results in 404 error

Posted by Uncle <un...@gmail.com>.
On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote:

> On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
>> Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.
>> 
>> I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:
>> 
>>        String url = "http://bit.ly/GGviSv";
>>        HttpGet httpget = new HttpGet(url);
>>        HttpContext context = new BasicHttpContext();
>>        HttpClient httpclient = new DefaultHttpClient();
>> 
>>        HttpResponse response = httpclient.execute(httpget, context);
>> 
>>        RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
>> 
>>        log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
>>        for(Header header : response.getAllHeaders())
>>            log.info("header: " + header);
>> 
>>        log.info("status = " + response.getStatusLine());
>> 
>> outputs:
>> 
>> isRedirected = false
>> header: Server: nginx
>> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
>> header: Content-Type: text/html; charset=UTF-8                                                                                                                          
>> header: Transfer-Encoding: chunked
>> header: Connection: keep-alive
>> header: Vary: Cookie
>> header: X-CF-Powered-By: WP 1.2.0
>> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
>> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
>> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
>> header: Cache-Control: no-cache, must-revalidate, max-age=0
>> header: Pragma: no-cache
>> status = HTTP/1.1 404 Not Found
>> 
>> I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:
>> 
>>        HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
>>        HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
>>        String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
>>        httpclient.getConnectionManager().shutdown();
>>        log.info("Redirected URL = " + currentUrl);
>> 
>> This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.
>> 
>> Thanks for any assistance.
>> 
>> Randy
>> 
> 
> As far as I can tell HttpClient correctly redirects to the new location,
> but the resource is simply no longer there.
> 
> [DEBUG] headers - >> GET /GGviSv HTTP/1.1
> [DEBUG] headers - >> Host: bit.ly
> [DEBUG] headers - >> Connection: Keep-Alive
> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> (java 1.5)
> [DEBUG] headers - << HTTP/1.1 301 Moved
> [DEBUG] headers - << Server: nginx
> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
> [DEBUG] headers - << Content-Type: text/html; charset=utf-8
> [DEBUG] headers - << Connection: keep-alive
> [DEBUG] headers - << Set-Cookie:
> _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
> 18:46:44 2012;path=/; HttpOnly
> [DEBUG] headers - << Cache-control: private; max-age=90
> [DEBUG] headers - << Location:
> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
> [DEBUG] headers - << MIME-Version: 1.0
> [DEBUG] headers - << Content-Length: 185
> [DEBUG] headers - >>
> GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
> [DEBUG] headers - >> Host: lavamagazine.com
> [DEBUG] headers - >> Connection: Keep-Alive
> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
> (java 1.5)
> [DEBUG] headers - << HTTP/1.1 404 Not Found
> [DEBUG] headers - << Server: nginx
> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8
> [DEBUG] headers - << Transfer-Encoding: chunked
> [DEBUG] headers - << Connection: keep-alive
> [DEBUG] headers - << Vary: Cookie
> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
> [DEBUG] headers - << Pragma: no-cache
> 
> Oleg
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

Yet, if you hit the URL: 

http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2

with your browser, the content comes up fine.  

Hitting the redirect URL with the standard Java HttpURLConnetion class does not produce the 404:

       String url = "http://bit.ly/GGviSv";
        URL urlObj = new URL(url);
        HttpURLConnection urlConnection = (HttpURLConnection)urlObj.openConnection();
        urlConnection.setRequestMethod("GET");
        urlConnection.setConnectTimeout(15000);
        urlConnection.setReadTimeout(30000);
        urlConnection.connect();
        log.info("Response code = " + urlConnection.getResponseCode());
        InputStream inputStream = urlConnection.getInputStream();
        log.info("Redirected URL = " + urlConnection.getURL().toString());

This outputs:

Response code = 200
Redirected URL = http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2

So HttpClient reports a 404, but HttpURLConnection reports a 200 and my browsers (Safari, Chrome, and FireFox) all hit the link fine.

Here is another URL that is problematic:

http://on.wsj.com/GHGlfS

this produces:

org.apache.http.client.ClientProtocolException
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
... snip ...
Caused by: org.apache.http.ProtocolException: Invalid redirect URI: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185)
	at org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116)
	at org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193)
	at org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
	... 28 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 72: http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonâ??s-death-an-accident/?mod=e2tw
	at java.net.URI$Parser.fail(URI.java:2809)
	at java.net.URI$Parser.checkChars(URI.java:2982)
	at java.net.URI$Parser.parseHierarchical(URI.java:3066)
	at java.net.URI$Parser.parse(URI.java:3014)
	at java.net.URI.<init>(URI.java:578)
	at org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183)
	... 33 more

The redirected URL has a special character in it (single quote), and the client doesn't handle that.  The Java code that I pasted above produces

Response code = 200
Redirected URL = http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw

Randy


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: Trying to follow 301 redirects results in 404 error

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote:
> Apologies if this has been addressed, I searched the archives and was unable to find anything directly relating to this, though it seems straightforward.
> 
> I am trying to use httpclient to obtain the redirect URL for a url such as http://bit.ly/GGviSv, but I am getting a 404 error.  This is a "permanent" redirect (code 301).  This code:
> 
>         String url = "http://bit.ly/GGviSv";
>         HttpGet httpget = new HttpGet(url);
>         HttpContext context = new BasicHttpContext();
>         HttpClient httpclient = new DefaultHttpClient();
> 
>         HttpResponse response = httpclient.execute(httpget, context);
> 
>         RedirectStrategy redirectStrategy = new DefaultRedirectStrategy();
> 
>         log.info("isRedirected = " + redirectStrategy.isRedirected(httpget, response, context));
>         for(Header header : response.getAllHeaders())
>             log.info("header: " + header);
> 
>         log.info("status = " + response.getStatusLine());
> 
> outputs:
> 
> isRedirected = false
> header: Server: nginx
> header: Date: Sat, 24 Mar 2012 12:38:43 GMT
> header: Content-Type: text/html; charset=UTF-8                                                                                                                          
> header: Transfer-Encoding: chunked
> header: Connection: keep-alive
> header: Vary: Cookie
> header: X-CF-Powered-By: WP 1.2.0
> header: X-Pingback: http://lavamagazine.com/xmlrpc.php
> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT
> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT
> header: Cache-Control: no-cache, must-revalidate, max-age=0
> header: Pragma: no-cache
> status = HTTP/1.1 404 Not Found
> 
> I expected 1) isRedirected to be true, 2) the response code to be 301, and/or 3) the destination URL to be in the headers where I could get it.  However, if I ignore the 404 and continue getting the URL:
> 
>         HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( ExecutionContext.HTTP_REQUEST );
>         HttpHost currentHost = (HttpHost)  context.getAttribute(ExecutionContext.HTTP_TARGET_HOST);
>         String currentUrl = (currentReq.getURI().isAbsolute()) ? currentReq.getURI().toString() : (currentHost.toURI() + currentReq.getURI());
>         httpclient.getConnectionManager().shutdown();
>         log.info("Redirected URL = " + currentUrl);
> 
> This does the right thing and provides me with the correct URL.  So, why the 404 error?  I am processing a large quantity of URL's and need to accurately determine which ones are errors, redirects, etc.
> 
> Thanks for any assistance.
> 
> Randy
> 

As far as I can tell HttpClient correctly redirects to the new location,
but the resource is simply no longer there.

[DEBUG] headers - >> GET /GGviSv HTTP/1.1
[DEBUG] headers - >> Host: bit.ly
[DEBUG] headers - >> Connection: Keep-Alive
[DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
(java 1.5)
[DEBUG] headers - << HTTP/1.1 301 Moved
[DEBUG] headers - << Server: nginx
[DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT
[DEBUG] headers - << Content-Type: text/html; charset=utf-8
[DEBUG] headers - << Connection: keep-alive
[DEBUG] headers - << Set-Cookie:
_bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20
18:46:44 2012;path=/; HttpOnly
[DEBUG] headers - << Cache-control: private; max-age=90
[DEBUG] headers - << Location:
http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2
[DEBUG] headers - << MIME-Version: 1.0
[DEBUG] headers - << Content-Length: 185
[DEBUG] headers - >>
GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 HTTP/1.1
[DEBUG] headers - >> Host: lavamagazine.com
[DEBUG] headers - >> Connection: Keep-Alive
[DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT
(java 1.5)
[DEBUG] headers - << HTTP/1.1 404 Not Found
[DEBUG] headers - << Server: nginx
[DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT
[DEBUG] headers - << Content-Type: text/html; charset=UTF-8
[DEBUG] headers - << Transfer-Encoding: chunked
[DEBUG] headers - << Connection: keep-alive
[DEBUG] headers - << Vary: Cookie
[DEBUG] headers - << X-CF-Powered-By: WP 1.2.0
[DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php
[DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT
[DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT
[DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0
[DEBUG] headers - << Pragma: no-cache

Oleg



---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org