You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Rod Taylor (JIRA)" <ji...@apache.org> on 2005/10/06 17:43:48 UTC

[jira] Created: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Network error during robots.txt fetch causes file to be ignored
---------------------------------------------------------------

         Key: NUTCH-105
         URL: http://issues.apache.org/jira/browse/NUTCH-105
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Rod Taylor


Earlier we had a small network glitch which prevented us from retrieving
the robots.txt file for a site we were crawling at the time:

        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/portfolio/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms
        nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
        task_m_h02y5t  Couldn't get robots.txt for
        http://www.japanesetranslator.co.uk/translation/:
        org.apache.commons.httpclient.ConnectTimeoutException: The host
        did not accept the connection within timeout of 10000 ms

Nutch then assumed that because we were unable to retrieve the file due
to network issues, that it didn't exist and we could crawl the entire
website. Nutch then successfully grabbed a few pages which were listed
in the robots.txt as being disallowed.

I think Nutch should continue attempting to retrieve the robots.txt file
until, at very least, we are able to establish a connection to the host,
otherwise the host should be ignored until the next round of fetches.

The webmaster of japanesetranslator.co.uk filed a complaint informing us
of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Sami Siren resolved NUTCH-105.
------------------------------

    Resolution: Fixed

This is now committed, thanks!

> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>            Reporter: Rod Taylor
>            Priority: Critical
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Sami Siren updated NUTCH-105:
-----------------------------

    Fix Version/s: 0.8.1
                   0.9.0

looks ok to me. If there is no objections I'll commit this before 0.8.1

> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>            Reporter: Rod Taylor
>            Priority: Critical
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Greg Kim (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Greg Kim updated NUTCH-105:
---------------------------

    Attachment: RobotRulesParser.patch

This patch will not cache the robots.txt on network errors/delays;  currently we cache EMPTY_RULES (allows everything) for a host X on network errors / delays... which potentially becomes a serious problem when the network returns during the same crawl iteration  - i.e. nutch will crawl *everything* on a host X since the EMPTY_RULES got cached from the first robots.txt failed GET attempt (due to network failure, not 404) 

> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Rod Taylor
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Greg Kim (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Greg Kim updated NUTCH-105:
---------------------------

    Component/s: fetcher
       Priority: Critical  (was: Major)


Any hope of getting this patch commited?  It's a simple fix for a potentially big problem.  I've seen the problem multiple times and it evokes great anger among webmasters.

> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0, 0.8.1
>            Reporter: Rod Taylor
>            Priority: Critical
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Greg Kim (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Greg Kim updated NUTCH-105:
---------------------------

    Affects Version/s: 0.8.1
                       0.9.0

> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8, 0.9.0, 0.8.1
>            Reporter: Rod Taylor
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Sami Siren closed NUTCH-105.
----------------------------


> Network error during robots.txt fetch causes file to be ignored
> ---------------------------------------------------------------
>
>                 Key: NUTCH-105
>                 URL: http://issues.apache.org/jira/browse/NUTCH-105
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>            Reporter: Rod Taylor
>            Priority: Critical
>             Fix For: 0.8.1, 0.9.0
>
>         Attachments: RobotRulesParser.patch
>
>
> Earlier we had a small network glitch which prevented us from retrieving
> the robots.txt file for a site we were crawling at the time:
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/portfolio/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
>         nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
>         task_m_h02y5t  Couldn't get robots.txt for
>         http://www.japanesetranslator.co.uk/translation/:
>         org.apache.commons.httpclient.ConnectTimeoutException: The host
>         did not accept the connection within timeout of 10000 ms
> Nutch then assumed that because we were unable to retrieve the file due
> to network issues, that it didn't exist and we could crawl the entire
> website. Nutch then successfully grabbed a few pages which were listed
> in the robots.txt as being disallowed.
> I think Nutch should continue attempting to retrieve the robots.txt file
> until, at very least, we are able to establish a connection to the host,
> otherwise the host should be ignored until the next round of fetches.
> The webmaster of japanesetranslator.co.uk filed a complaint informing us
> of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira