You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/10/06 17:31:22 UTC
Network failure vs http error
Earlier we had a small network glitch which prevented us from retrieving
the robots.txt file for a site we were crawling at the time:
nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
task_m_h02y5t Couldn't get robots.txt for
http://www.japanesetranslator.co.uk/portfolio/:
org.apache.commons.httpclient.ConnectTimeoutException: The host
did not accept the connection within timeout of 10000 ms
nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
task_m_h02y5t Couldn't get robots.txt for
http://www.japanesetranslator.co.uk/translation/:
org.apache.commons.httpclient.ConnectTimeoutException: The host
did not accept the connection within timeout of 10000 ms
Nutch then assumed that because we were unable to retrieve the file due
to network issues, that it didn't exist and we could crawl the entire
website. Nutch then successfully grabbed a few pages which were listed
in the robots.txt as being disallowed.
I think Nutch should continue attempting to retrieve the robots.txt file
until, at very least, we are able to establish a connection to the host,
otherwise the host should be ignored until the next round of fetches.
The webmaster of japanesetranslator.co.uk filed a complaint informing us
of the issue.
--
Rod Taylor <rb...@sitesell.com>