You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christian Weiske <ch...@netresearch.de> on 2011/08/01 08:41:07 UTC
"network timeout" on 404 pages
Hello,
I'm using the official nutch 1.3 distribution to crawl our internal
mediawiki instance. Whenever a 404 is encountered, I get a
> fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> with: java.net.SocketTimeoutException: Read timed out
The page really does not exist:
> $ curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> HTTP/1.1 404 Not Found
So I think the error message is misleading. Is that a bug?
--
Viele Grüße
Christian Weiske
Re: "network timeout" on 404 pages
Posted by Christian Weiske <ch...@netresearch.de>.
Hello Markus,
> > > I cannot confirm this when parsing a local 404 page. What do you
> > > get when fetching that page with:
> > > bin/nutch org.apache.nutch.parse.ParserChecker
> > I get an error:
> >
> > $ time bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread
> > "main" java.lang.NullPointerException at
> > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
> Strange! Can you confirm the parse checker with other 404 pages on
> the internet?
>
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://nutch.apache.org/404
This does work for me:
------------------
$ bin/nutch org.apache.nutch.parse.ParserChecker
http://nutch.apache.org/404 ---------
Url
---------------
http://nutch.apache.org/404---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: 404 Not Found
Outlinks: 0
Content Metadata: Date=Mon, 01 Aug 2011 11:29:46 GMT Content-Length=309
Content-Type=text/html; charset=iso-8859-1 Connection=close
Server=Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Parse Metadata:
CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252
------------------
> Perhaps your wiki returns some funny data that protocol plugin
> doesn't understand. What do you use? Protocol-http or
> protocol-httpclient?
I do use the standard settings except 3 custom ones in
conf/nutch-site.xml:
> http.agent.name, fetcher.server.delay and fetcher.threads.per.host
When I understood it right, conf/nutch-default.xml contains
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)
> |index-(basic|anchor)|scoring-opic
> |urlnormalizer-(pass|regex|basic)</value>
so it's "protocol-http".
--
Viele Grüße
Christian Weiske
Re: "network timeout" on 404 pages
Posted by Markus Jelsma <ma...@openindex.io>.
Strange! Can you confirm the parse checker with other 404 pages on the
internet?
bin/nutch org.apache.nutch.parse.ParserChecker http://nutch.apache.org/404
Perhaps your wiki returns some funny data that protocol plugin doesn't
understand. What do you use? Protocol-http or protocol-httpclient?
On Monday 01 August 2011 13:17:06 Christian Weiske wrote:
> Hello Markus,
>
> > > I'm using the official nutch 1.3 distribution to crawl our internal
> > > mediawiki instance. Whenever a 404 is encountered, I get a
> > >
> > > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > > with: java.net.SocketTimeoutException: Read timed out
> >
> > I cannot confirm this when parsing a local 404 page. What do you get
> > when fetching that page with:
> >
> > bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum
> >
> > you should get a nice 404
>
> I get an error:
>
> $ time bin/nutch org.apache.nutch.parse.ParserChecker
> http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
> java.lang.NullPointerException at
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
>
> real 0m13.007s
> user 0m1.530s
> sys 0m0.150s
>
>
> Curl does it nicely:
>
> $ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> HTTP/1.1 404 Not Found
> Date: Mon, 01 Aug 2011 11:14:57 GMT
> Server: Apache/2.2.16 (Debian)
> X-Powered-By: PHP/5.3.3-7+squeeze3
> Content-language: de
> Vary: Accept-Encoding,Cookie
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: private, must-revalidate, max-age=0
> Content-Type: text/html; charset=UTF-8
>
>
> real 0m0.434s
> user 0m0.010s
> sys 0m0.000s
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: "network timeout" on 404 pages
Posted by Christian Weiske <ch...@netresearch.de>.
Hello Markus,
> > I'm using the official nutch 1.3 distribution to crawl our internal
> > mediawiki instance. Whenever a 404 is encountered, I get a
> >
> > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > with: java.net.SocketTimeoutException: Read timed out
> I cannot confirm this when parsing a local 404 page. What do you get
> when fetching that page with:
>
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://wiki.example.org/INTERN_WIKI:Impressum
>
> you should get a nice 404
I get an error:
$ time bin/nutch org.apache.nutch.parse.ParserChecker
http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
java.lang.NullPointerException at
org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
real 0m13.007s
user 0m1.530s
sys 0m0.150s
Curl does it nicely:
$ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
HTTP/1.1 404 Not Found
Date: Mon, 01 Aug 2011 11:14:57 GMT
Server: Apache/2.2.16 (Debian)
X-Powered-By: PHP/5.3.3-7+squeeze3
Content-language: de
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8
real 0m0.434s
user 0m0.010s
sys 0m0.000s
--
Viele Grüße
Christian Weiske
Re: "network timeout" on 404 pages
Posted by Markus Jelsma <ma...@openindex.io>.
I cannot confirm this when parsing a local 404 page. What do you get when
fetching that page with:
bin/nutch org.apache.nutch.parse.ParserChecker
http://wiki.example.org/INTERN_WIKI:Impressum
you should get a nice 404
On Monday 01 August 2011 08:41:07 Christian Weiske wrote:
> Hello,
>
>
> I'm using the official nutch 1.3 distribution to crawl our internal
> mediawiki instance. Whenever a 404 is encountered, I get a
>
> > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > with: java.net.SocketTimeoutException: Read timed out
>
> The page really does not exist:
> > $ curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> > HTTP/1.1 404 Not Found
>
> So I think the error message is misleading. Is that a bug?
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350