You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/10/06 17:31:22 UTC

Network failure vs http error

Earlier we had a small network glitch which prevented us from retrieving
the robots.txt file for a site we were crawling at the time:

        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/portfolio/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms
        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/translation/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms

Nutch then assumed that because we were unable to retrieve the file due
to network issues, that it didn't exist and we could crawl the entire
website. Nutch then successfully grabbed a few pages which were listed
in the robots.txt as being disallowed.

I think Nutch should continue attempting to retrieve the robots.txt file
until, at very least, we are able to establish a connection to the host,
otherwise the host should be ignored until the next round of fetches.

The webmaster of japanesetranslator.co.uk filed a complaint informing us
of the issue.
-- 
Rod Taylor <rb...@sitesell.com>