You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/06/11 13:15:00 UTC

[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

    [ https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508019#comment-16508019 ] 

Sebastian Nagel commented on NUTCH-2557:
----------------------------------------

Hi [~omkar20895], hi [~gbouchar], [PR #347|https://github.com/apache/nutch/pull/347] contains Gerard's solution for this issue, see [commit d163512|https://github.com/apache/nutch/pull/347/commits/d163512d5d2e345dfe6c816a29dc93a108dfd254]. It does not skip reading payload content for redirects and other non-200 responses. But if reading the payload throws an exception, the exception is caught and ignored. Since it only affects responses which would fail otherwise, I've decided not introduce a new property. Let me know whether this is ok. Thanks!

> protocol-http fails to follow redirections when an HTTP response body is invalid
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2557
>             Project: Nutch
>          Issue Type: Sub-task
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), protocol-http tries to parse the HTTP response body anyway. Thus, if an error occurs while decoding the body, the redirection is not followed and the information is lost. Browsers follow the redirection and close the socket soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP body containing invalidly gzip encoded contents. Browsers follow the redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)