You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2012/04/20 11:40:40 UTC

[jira] [Created] (NUTCH-1342) Read time out protocol-http

Read time out protocol-http
---------------------------

                 Key: NUTCH-1342
                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.4, 1.5
            Reporter: Markus Jelsma
            Priority: Critical
             Fix For: 1.6


For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:

{code}
2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
        at java.io.FilterInputStream.read(FilterInputStream.java:90)
        at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
        at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
{code}

Some example URL's:
* 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
* 301 http://shop.fcgroningen.nl/aanbieding


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1342) Read time out protocol-http

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1342:
------------------------------------

    Assignee: Markus Jelsma
    
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398354#comment-13398354 ] 

Markus Jelsma commented on NUTCH-1342:
--------------------------------------

Hi Ferdy,

No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it!


I'll see if i can find another slow website :)
                
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294429#comment-13294429 ] 

Ferdy Galema commented on NUTCH-1342:
-------------------------------------

Do you have any clue as to why protocol-httpclient has a different behaviour?

Also, two suggestions for your patch:

Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like:
if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow

Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)
                
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291222#comment-13291222 ] 

Markus Jelsma commented on NUTCH-1342:
--------------------------------------

Unless there are objections or improvements, i'll commit this one in the next few days.
                
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1342) Read time out protocol-http

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1342:
---------------------------------

    Attachment: NUTCH-1342-1.6-1.patch

Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright.

This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out.

Please comment!
                
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1342) Read time out protocol-http

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1342:
---------------------------------

    Patch Info: Patch Available
    
> Read time out protocol-http
> ---------------------------
>
>                 Key: NUTCH-1342
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1342
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4, 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
>         at java.io.FilterInputStream.read(FilterInputStream.java:90)
>         at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
>         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
>         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira