You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nguyen Ngoc Giang <gi...@gmail.com> on 2005/12/21 09:00:54 UTC

Read Time out problem

  Hi folks,

  When I try crawling, there are many Read Timeout error. It seems that this
error is not caught as properly as http.max.delays. I would like to catch
this error in the same manner with http.max.delays, that is to retry the
page with this error. Can anyone suggest a way? Any can anyone explain for
me why does this error happen?

  PS: I'm running Nutch on Windows XP SP2.

java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
        at org.apache.commons.httpclient.HttpParser.readRawLine(
HttpParser.java:
77)
        at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java
:105
)
        at org.apache.commons.httpclient.HttpConnection.readLine
(HttpConnection.
java:1110)
        at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1391)
        at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
(HttpMetho
dBase.java:1824)
        at org.apache.commons.httpclient.HttpMethodBase.readResponse
(HttpMethodB
ase.java:1584)
        at org.apache.commons.httpclient.HttpMethodBase.execute(
HttpMethodBase.j
ava:995)
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
(Htt
pMethodDirector.java:393)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod
(HttpMe
thodDirector.java:168)
        at org.apache.commons.httpclient.HttpClient.executeMethod(
HttpClient.jav
a:393)
        at org.apache.commons.httpclient.HttpClient.executeMethod(
HttpClient.jav
a:324)
        at org.apache.nutch.protocol.httpclient.HttpResponse
.<init>(HttpResponse
.java:102)
        at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
Http.java
:204)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:151)

Re: Read Time out problem

Posted by Nguyen Ngoc Giang <gi...@gmail.com>.
  Thanks Stefan for replying me. My network is fine with proxy/firewall (I'm
still crawling daily). The connection maybe a bit slow and I think that may
cause problem.

  Anyway, I believe that the way this error is handled is still somewhat
problematic. At line 204 of org.apache.nutch.protocol.httpclient.Http.java,
we create an object org.apache.nutch.protocol.httpclient.HttpResponse.
However, inside the constructor of HttpResponse, it seems that we catch only
for org.apache.commons.httpclient.ProtocolException, not
SocketTimeoutException, therefore the program cannot continue.

  Thanks in advance for any suggest.

  Regards,
   Giang


On 12/21/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> Sounds like a network problem (to slow?) any proxy / firewall in use?
> Can you manually check if you can reach connect this urls from this box.
> Also try to increase http.timeout in nutch-default/site.xml
>
> I'm not sure but I think these kind of failed urls are also tried to
> refech another time (db.fetch.retry.max)
>
> HTH
> Stefan
>
> Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang:
>
> >   Hi folks,
> >
> >   When I try crawling, there are many Read Timeout error. It seems
> > that this
> > error is not caught as properly as http.max.delays. I would like to
> > catch
> > this error in the same manner with http.max.delays, that is to
> > retry the
> > page with this error. Can anyone suggest a way? Any can anyone
> > explain for
> > me why does this error happen?
> >
> >   PS: I'm running Nutch on Windows XP SP2.
> >
> > java.net.SocketTimeoutException: Read timed out
> >         at java.net.SocketInputStream.socketRead0(Native Method)
> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> >         at java.io.BufferedInputStream.fill
> > (BufferedInputStream.java:218)
> >         at java.io.BufferedInputStream.read
> > (BufferedInputStream.java:235)
> >         at org.apache.commons.httpclient.HttpParser.readRawLine(
> > HttpParser.java:
> > 77)
> >         at org.apache.commons.httpclient.HttpParser.readLine
> > (HttpParser.java
> > :105
> > )
> >         at org.apache.commons.httpclient.HttpConnection.readLine
> > (HttpConnection.
> > java:1110)
> >         at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
> > ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:
> > 1391)
> >         at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
> > (HttpMetho
> > dBase.java:1824)
> >         at org.apache.commons.httpclient.HttpMethodBase.readResponse
> > (HttpMethodB
> > ase.java:1584)
> >         at org.apache.commons.httpclient.HttpMethodBase.execute(
> > HttpMethodBase.j
> > ava:995)
> >         at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
> > (Htt
> > pMethodDirector.java:393)
> >         at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod
> > (HttpMe
> > thodDirector.java:168)
> >         at org.apache.commons.httpclient.HttpClient.executeMethod(
> > HttpClient.jav
> > a:393)
> >         at org.apache.commons.httpclient.HttpClient.executeMethod(
> > HttpClient.jav
> > a:324)
> >         at org.apache.nutch.protocol.httpclient.HttpResponse
> > .<init>(HttpResponse
> > .java:102)
> >         at
> > org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
> > Http.java
> > :204)
> >         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run
> > (Fetcher.java
> > :151)
>
>

Re: Read Time out problem

Posted by Stefan Groschupf <sg...@media-style.com>.
Sounds like a network problem (to slow?) any proxy / firewall in use?
Can you manually check if you can reach connect this urls from this box.
Also try to increase http.timeout in nutch-default/site.xml

I'm not sure but I think these kind of failed urls are also tried to  
refech another time (db.fetch.retry.max)

HTH
Stefan

Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang:

>   Hi folks,
>
>   When I try crawling, there are many Read Timeout error. It seems  
> that this
> error is not caught as properly as http.max.delays. I would like to  
> catch
> this error in the same manner with http.max.delays, that is to  
> retry the
> page with this error. Can anyone suggest a way? Any can anyone  
> explain for
> me why does this error happen?
>
>   PS: I'm running Nutch on Windows XP SP2.
>
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill 
> (BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read 
> (BufferedInputStream.java:235)
>         at org.apache.commons.httpclient.HttpParser.readRawLine(
> HttpParser.java:
> 77)
>         at org.apache.commons.httpclient.HttpParser.readLine 
> (HttpParser.java
> :105
> )
>         at org.apache.commons.httpclient.HttpConnection.readLine
> (HttpConnection.
> java:1110)
>         at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
> ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java: 
> 1391)
>         at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
> (HttpMetho
> dBase.java:1824)
>         at org.apache.commons.httpclient.HttpMethodBase.readResponse
> (HttpMethodB
> ase.java:1584)
>         at org.apache.commons.httpclient.HttpMethodBase.execute(
> HttpMethodBase.j
> ava:995)
>         at  
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
> (Htt
> pMethodDirector.java:393)
>         at  
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod
> (HttpMe
> thodDirector.java:168)
>         at org.apache.commons.httpclient.HttpClient.executeMethod(
> HttpClient.jav
> a:393)
>         at org.apache.commons.httpclient.HttpClient.executeMethod(
> HttpClient.jav
> a:324)
>         at org.apache.nutch.protocol.httpclient.HttpResponse
> .<init>(HttpResponse
> .java:102)
>         at  
> org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
> Http.java
> :204)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run 
> (Fetcher.java
> :151)