You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christian Weiske <ch...@netresearch.de> on 2011/08/01 08:41:07 UTC

"network timeout" on 404 pages

Hello,


I'm using the official nutch 1.3 distribution to crawl our internal
mediawiki instance. Whenever a 404 is encountered, I get a 

> fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> with: java.net.SocketTimeoutException: Read timed out

The page really does not exist:
> $ curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> HTTP/1.1 404 Not Found

So I think the error message is misleading. Is that a bug?

-- 
Viele Grüße
Christian Weiske

Re: "network timeout" on 404 pages

Posted by Christian Weiske <ch...@netresearch.de>.
Hello Markus,



> > > I cannot confirm this when parsing a local 404 page. What do you
> > > get when fetching that page with:
> > > bin/nutch org.apache.nutch.parse.ParserChecker
> > I get an error:
> > 
> > $ time bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread
> > "main" java.lang.NullPointerException at
> > org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

> Strange! Can you confirm the parse checker with other 404 pages on
> the internet?
> 
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://nutch.apache.org/404

This does work for me:
------------------
$ bin/nutch org.apache.nutch.parse.ParserChecker
http://nutch.apache.org/404 ---------
Url
---------------
http://nutch.apache.org/404---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: 404 Not Found
Outlinks: 0
Content Metadata: Date=Mon, 01 Aug 2011 11:29:46 GMT Content-Length=309
Content-Type=text/html; charset=iso-8859-1 Connection=close
Server=Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Parse Metadata:
CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252 
------------------


> Perhaps your wiki returns some funny data that protocol plugin
> doesn't understand. What do you use? Protocol-http or
> protocol-httpclient?

I do use the standard settings except 3 custom ones in
conf/nutch-site.xml:
> http.agent.name, fetcher.server.delay and fetcher.threads.per.host

When I understood it right, conf/nutch-default.xml contains
>  <name>plugin.includes</name>
>  <value>protocol-http|urlfilter-regex|parse-(html|tika)
> |index-(basic|anchor)|scoring-opic
> |urlnormalizer-(pass|regex|basic)</value>
so it's "protocol-http".


-- 
Viele Grüße
Christian Weiske

Re: "network timeout" on 404 pages

Posted by Markus Jelsma <ma...@openindex.io>.
Strange! Can you confirm the parse checker with other 404 pages on the 
internet?

bin/nutch org.apache.nutch.parse.ParserChecker http://nutch.apache.org/404

Perhaps your wiki returns some funny data that protocol plugin doesn't 
understand. What do you use? Protocol-http or protocol-httpclient?

On Monday 01 August 2011 13:17:06 Christian Weiske wrote:
> Hello Markus,
> 
> > > I'm using the official nutch 1.3 distribution to crawl our internal
> > > mediawiki instance. Whenever a 404 is encountered, I get a
> > > 
> > > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > > with: java.net.SocketTimeoutException: Read timed out
> > 
> > I cannot confirm this when parsing a local 404 page. What do you get
> > when fetching that page with:
> > 
> > bin/nutch org.apache.nutch.parse.ParserChecker
> > http://wiki.example.org/INTERN_WIKI:Impressum
> > 
> > you should get a nice 404
> 
> I get an error:
> 
> $ time bin/nutch org.apache.nutch.parse.ParserChecker
> http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
> java.lang.NullPointerException at
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
> 
> real	0m13.007s
> user	0m1.530s
> sys	0m0.150s
> 
> 
> Curl does it nicely:
> 
> $ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> HTTP/1.1 404 Not Found
> Date: Mon, 01 Aug 2011 11:14:57 GMT
> Server: Apache/2.2.16 (Debian)
> X-Powered-By: PHP/5.3.3-7+squeeze3
> Content-language: de
> Vary: Accept-Encoding,Cookie
> Expires: Thu, 01 Jan 1970 00:00:00 GMT
> Cache-Control: private, must-revalidate, max-age=0
> Content-Type: text/html; charset=UTF-8
> 
> 
> real	0m0.434s
> user	0m0.010s
> sys	0m0.000s

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: "network timeout" on 404 pages

Posted by Christian Weiske <ch...@netresearch.de>.
Hello Markus,


> > I'm using the official nutch 1.3 distribution to crawl our internal
> > mediawiki instance. Whenever a 404 is encountered, I get a
> > 
> > > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > > with: java.net.SocketTimeoutException: Read timed out

> I cannot confirm this when parsing a local 404 page. What do you get
> when fetching that page with:
> 
> bin/nutch org.apache.nutch.parse.ParserChecker 
> http://wiki.example.org/INTERN_WIKI:Impressum
> 
> you should get a nice 404


I get an error:

$ time bin/nutch org.apache.nutch.parse.ParserChecker
http://wiki.example.org/INTERN_WIKI:Impressum Exception in thread "main"
java.lang.NullPointerException at
org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

real	0m13.007s
user	0m1.530s
sys	0m0.150s


Curl does it nicely:

$ time curl -I http://wiki.example.org/INTERN_WIKI:Impressum
HTTP/1.1 404 Not Found
Date: Mon, 01 Aug 2011 11:14:57 GMT
Server: Apache/2.2.16 (Debian)
X-Powered-By: PHP/5.3.3-7+squeeze3
Content-language: de
Vary: Accept-Encoding,Cookie
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: private, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8


real	0m0.434s
user	0m0.010s
sys	0m0.000s


-- 
Viele Grüße
Christian Weiske

Re: "network timeout" on 404 pages

Posted by Markus Jelsma <ma...@openindex.io>.
I cannot confirm this when parsing a local 404 page. What do you get when 
fetching that page with:

bin/nutch org.apache.nutch.parse.ParserChecker 
http://wiki.example.org/INTERN_WIKI:Impressum

you should get a nice 404


On Monday 01 August 2011 08:41:07 Christian Weiske wrote:
> Hello,
> 
> 
> I'm using the official nutch 1.3 distribution to crawl our internal
> mediawiki instance. Whenever a 404 is encountered, I get a
> 
> > fetch of http://wiki.example.org/INTERN_WIKI:Impressum failed
> > with: java.net.SocketTimeoutException: Read timed out
> 
> The page really does not exist:
> > $ curl -I http://wiki.example.org/INTERN_WIKI:Impressum
> > HTTP/1.1 404 Not Found
> 
> So I think the error message is misleading. Is that a bug?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350