You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Enrique Berlanga (JIRA)" <ji...@apache.org> on 2010/11/23 18:52:13 UTC

[jira] Created: (NUTCH-938) Imposible to fetch sites with robots.txt

Imposible to fetch sites with robots.txt 
-----------------------------------------

                 Key: NUTCH-938
                 URL: https://issues.apache.org/jira/browse/NUTCH-938
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.2
         Environment: red hat, nutch 1.2, jaca 1.6
            Reporter: Enrique Berlanga


Crawling a site with a robots.txt file like this:
-------------------
User-agent: *
Disallow: /
-------------------
No links are followed. 

It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:

// set non-blocking & no-robots mode for HTTP protocol plugins.
    getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
    getConf().setBoolean(Protocol.CHECK_ROBOTS, false);

False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
----------------
RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
if (!rules.isAllowed(fit.u)) {
 ...
LOG.debug("Denied by robots.txt: " + fit.url);
...
continue;
}
-----------------------

I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.

Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935745#action_12935745 ] 

Andrzej Bialecki  commented on NUTCH-938:
-----------------------------------------

These two properties are documented in nutch-default.xml, but they are mostly for internal use by Nutch. Other implementations of Fetcher (the OldFetcher) used to delegate the robot and politeness controls to protocol plugins. The current implementation of Fetcher performs these tasks itself, although in 1.2 protocol plugins still retain the code to implement these controls per protocol. In 1.3 (unreleased) and trunk this support has been removed from protocol plugins, so these lines will have no effect.

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935555#action_12935555 ] 

Andrzej Bialecki  commented on NUTCH-938:
-----------------------------------------

Nutch behavior in this case is correct. The goal of Nutch is to implement a well-behaved crawler that obeys robot rules and netiquette. Your patch simply disables these control mechanisms. If it works for you and you can risk the wrath of webmasters, that's fine, you are free to use this patch  - but Nutch as a project cannot encourage such practice.

Consequently I'm going to mark this issue as Won't Fix.

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Enrique Berlanga (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrique Berlanga updated NUTCH-938:
-----------------------------------

    Attachment: NUTCH-938.patch

Patch solving NUTCH-938 issue with robots.txt file in some sites

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Enrique Berlanga (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935728#action_12935728 ] 

Enrique Berlanga commented on NUTCH-938:
----------------------------------------

Thanks for your answer. I agree with you that Nutch as a project cannot encourage such practice, but maybe some code in Protocol or Fetcher class need to be removed from official source. If not, It's hard to understand why this lines appear in the main method of the class ...
--------
// set non-blocking & no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
--------
... and later in fetcher thread that values are ignored.
Maybe some notes in crawl-urlfilter.txt showing these properties as deprecated would be great.

My question is: Is there any reason to force it to false? A well-behaved crawler that obeys robot rules and netiquette must force it to true, what makes me being a little confused about that part of the code. I would prefer to feel free to change the behaviour by changing "protocol.plugin.check.robots" value in crawl-urlfilter.txt file.
Thanks in advance

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Enrique Berlanga (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrique Berlanga closed NUTCH-938.
----------------------------------

    Resolution: Won't Fix

Resolved as "Won't Fix" acording to Andrzej Bialecki.

> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>         Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-938) Imposible to fetch sites with robots.txt

Posted by "Enrique Berlanga (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrique Berlanga updated NUTCH-938:
-----------------------------------

    Description: 
Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
-------------------
User-agent: *
Disallow: /
-------------------
No links are followed. 

It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:

// set non-blocking & no-robots mode for HTTP protocol plugins.
    getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
    getConf().setBoolean(Protocol.CHECK_ROBOTS, false);

False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
----------------
RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
if (!rules.isAllowed(fit.u)) {
 ...
LOG.debug("Denied by robots.txt: " + fit.url);
...
continue;
}
-----------------------

I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.

Thanks in advance

  was:
Crawling a site with a robots.txt file like this:
-------------------
User-agent: *
Disallow: /
-------------------
No links are followed. 

It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:

// set non-blocking & no-robots mode for HTTP protocol plugins.
    getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
    getConf().setBoolean(Protocol.CHECK_ROBOTS, false);

False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
----------------
RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
if (!rules.isAllowed(fit.u)) {
 ...
LOG.debug("Denied by robots.txt: " + fit.url);
...
continue;
}
-----------------------

I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.

Thanks in advance


> Imposible to fetch sites with robots.txt 
> -----------------------------------------
>
>                 Key: NUTCH-938
>                 URL: https://issues.apache.org/jira/browse/NUTCH-938
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.2
>         Environment: red hat, nutch 1.2, jaca 1.6
>            Reporter: Enrique Berlanga
>
> Crawling a site with a robots.txt file like this:  (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
>     getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
>     getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.