You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2011/02/02 19:15:29 UTC
[jira] Commented: (CONNECTORS-153) Crawler should follow the robots
meta tag rules
[ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989734#comment-12989734 ]
Karl Wright commented on CONNECTORS-153:
----------------------------------------
Didn't quite work. Also needed code in r1066559.
> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
> Key: CONNECTORS-153
> URL: https://issues.apache.org/jira/browse/CONNECTORS-153
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Web connector
> Affects Versions: ManifoldCF 0.1
> Reporter: Erlend GarĂ¥sen
> Assignee: Karl Wright
> Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira