You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/04/20 11:40:40 UTC
[jira] [Created] (NUTCH-1342) Read time out protocol-http
Read time out protocol-http
---------------------------
Key: NUTCH-1342
URL: https://issues.apache.org/jira/browse/NUTCH-1342
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.4, 1.5
Reporter: Markus Jelsma
Priority: Critical
Fix For: 1.6
For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
{code}
2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
at java.io.FilterInputStream.read(FilterInputStream.java:90)
at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
{code}
Some example URL's:
* 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
* 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1342) Read time out protocol-http
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma reassigned NUTCH-1342:
------------------------------------
Assignee: Markus Jelsma
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1342) Read time out protocol-http
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398354#comment-13398354 ]
Markus Jelsma commented on NUTCH-1342:
--------------------------------------
Hi Ferdy,
No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it!
I'll see if i can find another slow website :)
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1342) Read time out protocol-http
Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294429#comment-13294429 ]
Ferdy Galema commented on NUTCH-1342:
-------------------------------------
Do you have any clue as to why protocol-httpclient has a different behaviour?
Also, two suggestions for your patch:
Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like:
if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow
Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1342) Read time out protocol-http
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291222#comment-13291222 ]
Markus Jelsma commented on NUTCH-1342:
--------------------------------------
Unless there are objections or improvements, i'll commit this one in the next few days.
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1342) Read time out protocol-http
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1342:
---------------------------------
Attachment: NUTCH-1342-1.6-1.patch
Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright.
This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out.
Please comment!
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1342) Read time out protocol-http
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1342:
---------------------------------
Patch Info: Patch Available
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira