You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by "Erlend Garåsen (JIRA)" <ji...@apache.org> on 2011/01/28 11:29:43 UTC

[jira] Created: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Crawler should follow the robots meta tag rules
-----------------------------------------------

                 Key: CONNECTORS-153
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
    Affects Versions: ManifoldCF 0.1
            Reporter: Erlend Garåsen
             Fix For: ManifoldCF next


The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
<meta name="robots" content="noindex, nofollow" />

I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:

1. <meta name="robots" content="noindex, nofollow" />
- do not fetch the document at all

2. <meta name="robots" content="noindex, follow" />
- only follow the other links in this document

3. <meta name="robots" content="index, nofollow" />
- fetch the document, but do no follow any link in it.

4. Change most of the text that appear on the page for robots option settings to something like:
"Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
"Don't look at robots.txt" => "Ignore robots settings"
"Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
"Obey robots.txt for all fetces" => "Follow robots rules for all fetches"



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988042#action_12988042 ] 

Karl Wright commented on CONNECTORS-153:
----------------------------------------

Question:  Do you see a need to have interpretation of the <meta> robots tag to be controlled by configuration?  I was thinking it should simply obey that all the time.  The reason you can override robots.txt rules is that people get them wrong a lot, but it's hard to imagine per-page tags being wrong consistently.


> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright reassigned CONNECTORS-153:
--------------------------------------

    Assignee: Karl Wright

> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>            Assignee: Karl Wright
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-153.
------------------------------------

    Resolution: Fixed

r1064643.  Still awaiting confirmation of fix from reporter.


> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>            Assignee: Karl Wright
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988063#action_12988063 ] 

Karl Wright commented on CONNECTORS-153:
----------------------------------------

Also, r1064645, CHANGES.txt


> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>            Assignee: Karl Wright
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Erlend Garåsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988046#action_12988046 ] 

Erlend Garåsen commented on CONNECTORS-153:
-------------------------------------------

I guess you're right. No, I don't see a particular need to control how these meta tags should be interpreted by the crawler. I was just thinking that all these rules could be configured at the same place and that the text should be justified accordingly. I think the best thing is to obey these meta tags all the time, just like you suggest.

> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CONNECTORS-153) Crawler should follow the robots meta tag rules

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CONNECTORS-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989734#comment-12989734 ] 

Karl Wright commented on CONNECTORS-153:
----------------------------------------

Didn't quite work.  Also needed code in r1066559.

> Crawler should follow the robots meta tag rules
> -----------------------------------------------
>
>                 Key: CONNECTORS-153
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.1
>            Reporter: Erlend Garåsen
>            Assignee: Karl Wright
>             Fix For: ManifoldCF next
>
>
> The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
> <meta name="robots" content="noindex, nofollow" />
> I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:
> 1. <meta name="robots" content="noindex, nofollow" />
> - do not fetch the document at all
> 2. <meta name="robots" content="noindex, follow" />
> - only follow the other links in this document
> 3. <meta name="robots" content="index, nofollow" />
> - fetch the document, but do no follow any link in it.
> 4. Change most of the text that appear on the page for robots option settings to something like:
> "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
> "Don't look at robots.txt" => "Ignore robots settings"
> "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
> "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira