You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nguyen Ngoc Giang <gi...@gmail.com> on 2005/12/21 09:00:54 UTC
Read Time out problem
Hi folks,
When I try crawling, there are many Read Timeout error. It seems that this
error is not caught as properly as http.max.delays. I would like to catch
this error in the same manner with http.max.delays, that is to retry the
page with this error. Can anyone suggest a way? Any can anyone explain for
me why does this error happen?
PS: I'm running Nutch on Windows XP SP2.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at org.apache.commons.httpclient.HttpParser.readRawLine(
HttpParser.java:
77)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java
:105
)
at org.apache.commons.httpclient.HttpConnection.readLine
(HttpConnection.
java:1110)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1391)
at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
(HttpMetho
dBase.java:1824)
at org.apache.commons.httpclient.HttpMethodBase.readResponse
(HttpMethodB
ase.java:1584)
at org.apache.commons.httpclient.HttpMethodBase.execute(
HttpMethodBase.j
ava:995)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
(Htt
pMethodDirector.java:393)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod
(HttpMe
thodDirector.java:168)
at org.apache.commons.httpclient.HttpClient.executeMethod(
HttpClient.jav
a:393)
at org.apache.commons.httpclient.HttpClient.executeMethod(
HttpClient.jav
a:324)
at org.apache.nutch.protocol.httpclient.HttpResponse
.<init>(HttpResponse
.java:102)
at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
Http.java
:204)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java
:151)
Re: Read Time out problem
Posted by Nguyen Ngoc Giang <gi...@gmail.com>.
Thanks Stefan for replying me. My network is fine with proxy/firewall (I'm
still crawling daily). The connection maybe a bit slow and I think that may
cause problem.
Anyway, I believe that the way this error is handled is still somewhat
problematic. At line 204 of org.apache.nutch.protocol.httpclient.Http.java,
we create an object org.apache.nutch.protocol.httpclient.HttpResponse.
However, inside the constructor of HttpResponse, it seems that we catch only
for org.apache.commons.httpclient.ProtocolException, not
SocketTimeoutException, therefore the program cannot continue.
Thanks in advance for any suggest.
Regards,
Giang
On 12/21/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> Sounds like a network problem (to slow?) any proxy / firewall in use?
> Can you manually check if you can reach connect this urls from this box.
> Also try to increase http.timeout in nutch-default/site.xml
>
> I'm not sure but I think these kind of failed urls are also tried to
> refech another time (db.fetch.retry.max)
>
> HTH
> Stefan
>
> Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang:
>
> > Hi folks,
> >
> > When I try crawling, there are many Read Timeout error. It seems
> > that this
> > error is not caught as properly as http.max.delays. I would like to
> > catch
> > this error in the same manner with http.max.delays, that is to
> > retry the
> > page with this error. Can anyone suggest a way? Any can anyone
> > explain for
> > me why does this error happen?
> >
> > PS: I'm running Nutch on Windows XP SP2.
> >
> > java.net.SocketTimeoutException: Read timed out
> > at java.net.SocketInputStream.socketRead0(Native Method)
> > at java.net.SocketInputStream.read(SocketInputStream.java:129)
> > at java.io.BufferedInputStream.fill
> > (BufferedInputStream.java:218)
> > at java.io.BufferedInputStream.read
> > (BufferedInputStream.java:235)
> > at org.apache.commons.httpclient.HttpParser.readRawLine(
> > HttpParser.java:
> > 77)
> > at org.apache.commons.httpclient.HttpParser.readLine
> > (HttpParser.java
> > :105
> > )
> > at org.apache.commons.httpclient.HttpConnection.readLine
> > (HttpConnection.
> > java:1110)
> > at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
> > ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:
> > 1391)
> > at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
> > (HttpMetho
> > dBase.java:1824)
> > at org.apache.commons.httpclient.HttpMethodBase.readResponse
> > (HttpMethodB
> > ase.java:1584)
> > at org.apache.commons.httpclient.HttpMethodBase.execute(
> > HttpMethodBase.j
> > ava:995)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
> > (Htt
> > pMethodDirector.java:393)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod
> > (HttpMe
> > thodDirector.java:168)
> > at org.apache.commons.httpclient.HttpClient.executeMethod(
> > HttpClient.jav
> > a:393)
> > at org.apache.commons.httpclient.HttpClient.executeMethod(
> > HttpClient.jav
> > a:324)
> > at org.apache.nutch.protocol.httpclient.HttpResponse
> > .<init>(HttpResponse
> > .java:102)
> > at
> > org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
> > Http.java
> > :204)
> > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run
> > (Fetcher.java
> > :151)
>
>
Re: Read Time out problem
Posted by Stefan Groschupf <sg...@media-style.com>.
Sounds like a network problem (to slow?) any proxy / firewall in use?
Can you manually check if you can reach connect this urls from this box.
Also try to increase http.timeout in nutch-default/site.xml
I'm not sure but I think these kind of failed urls are also tried to
refech another time (db.fetch.retry.max)
HTH
Stefan
Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang:
> Hi folks,
>
> When I try crawling, there are many Read Timeout error. It seems
> that this
> error is not caught as properly as http.max.delays. I would like to
> catch
> this error in the same manner with http.max.delays, that is to
> retry the
> page with this error. Can anyone suggest a way? Any can anyone
> explain for
> me why does this error happen?
>
> PS: I'm running Nutch on Windows XP SP2.
>
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill
> (BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read
> (BufferedInputStream.java:235)
> at org.apache.commons.httpclient.HttpParser.readRawLine(
> HttpParser.java:
> 77)
> at org.apache.commons.httpclient.HttpParser.readLine
> (HttpParser.java
> :105
> )
> at org.apache.commons.httpclient.HttpConnection.readLine
> (HttpConnection.
> java:1110)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http
> ConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:
> 1391)
> at org.apache.commons.httpclient.HttpMethodBase.readStatusLine
> (HttpMetho
> dBase.java:1824)
> at org.apache.commons.httpclient.HttpMethodBase.readResponse
> (HttpMethodB
> ase.java:1584)
> at org.apache.commons.httpclient.HttpMethodBase.execute(
> HttpMethodBase.j
> ava:995)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry
> (Htt
> pMethodDirector.java:393)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod
> (HttpMe
> thodDirector.java:168)
> at org.apache.commons.httpclient.HttpClient.executeMethod(
> HttpClient.jav
> a:393)
> at org.apache.commons.httpclient.HttpClient.executeMethod(
> HttpClient.jav
> a:324)
> at org.apache.nutch.protocol.httpclient.HttpResponse
> .<init>(HttpResponse
> .java:102)
> at
> org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(
> Http.java
> :204)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run
> (Fetcher.java
> :151)