You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Enrique Berlanga (JIRA)" <ji...@apache.org> on 2010/11/23 19:13:14 UTC
[jira] Updated: (NUTCH-938) Imposible to fetch sites with
robots.txt
[ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enrique Berlanga updated NUTCH-938:
-----------------------------------
Description:
Crawling a site with a robots.txt file like this: (e.g: http://www.melilla.es)
-------------------
User-agent: *
Disallow: /
-------------------
No links are followed.
It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
// set non-blocking & no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
----------------
RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
if (!rules.isAllowed(fit.u)) {
...
LOG.debug("Denied by robots.txt: " + fit.url);
...
continue;
}
-----------------------
I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
Thanks in advance
was:
Crawling a site with a robots.txt file like this:
-------------------
User-agent: *
Disallow: /
-------------------
No links are followed.
It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
// set non-blocking & no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
----------------
RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
if (!rules.isAllowed(fit.u)) {
...
LOG.debug("Denied by robots.txt: " + fit.url);
...
continue;
}
-----------------------
I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
Thanks in advance
> Imposible to fetch sites with robots.txt
> -----------------------------------------
>
> Key: NUTCH-938
> URL: https://issues.apache.org/jira/browse/NUTCH-938
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.2
> Environment: red hat, nutch 1.2, jaca 1.6
> Reporter: Enrique Berlanga
>
> Crawling a site with a robots.txt file like this: (e.g: http://www.melilla.es)
> -------------------
> User-agent: *
> Disallow: /
> -------------------
> No links are followed.
> It doesn't matters the value set at "protocol.plugin.check.blocking" or "protocol.plugin.check.robots" properties, because they are overloaded in class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
> getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
> getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are checket ignoring the configuration:
> ----------------
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
> ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> -----------------------
> I suposse there is no problem in disabling that part of the code directly for HTTP protocol. If so, I could submit a patch as soon as posible to get over this.
> Thanks in advance
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.