You are viewing a plain text version of this content. The canonical link for it is here.

Posted to infrastructure-issues@apache.org by "Brett Porter (JIRA)" <ji...@apache.org> on 2007/08/29 14:51:31 UTC

[jira] Created: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
-----------------------------------------------------------------------------------------

                 Key: INFRA-1343
                 URL: https://issues.apache.org/jira/browse/INFRA-1343
             Project: Infrastructure
          Issue Type: Task
      Security Level: public (Regular issues)
          Components: Continuum
            Reporter: Brett Porter


We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "Daniel Kulp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Kulp resolved INFRA-1343.
--------------------------------

    Resolution: Fixed


OK.   I'm an idiot who needs more coffee before doing anything Monday mornings.   This is resolved for CONTINUUM, just not Confluence.   New bug logged there.

> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "Daniel Kulp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Kulp reopened INFRA-1343:
--------------------------------



I'm going to reopen this as the current solution is extremely problematic.

It only has:
Disallow: /confluence/


That means the "static" content for all the spaces is indexable by the crawlers.   For sites that are copying the content to their project spaces, that means it's getting indexed at both cwiki and in the "real" spots.    In many cases, the cwiki pages are showing up in search results at google instead of the real pages. 

Basically, we need a way for each space to "opt out" of being indexed on cwiki.   

For the short term, can we add:

Disallow: /CXF/
Disallow: /CXF20DOC/
Disallow: /ACTIVEMQ/
Disallow: /CAMEL/
Disallow: /SM/
Disallow: /SMX3/
Disallow: /SMX4/
Disallow: /SMX4KNL/
Disallow: /SMX4NMR/
Disallow: /SMX4RUN/
Disallow: /SMXCOMP/
Disallow: /TUSCANY/


Probably a bunch of others as well.    I almost want to suggest default is disallowed with an "Opt In" per space, just not sure how to accomplish that.




> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "#asfinfra IRC Bot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715395#action_12715395 ] 

#asfinfra IRC Bot commented on INFRA-1343:
------------------------------------------

<gmcdonald> I dont the need for a bot to crawl anywhere on vmbuild, so should we just set disallow=* on all of it?


> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "#asfinfra IRC Bot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715397#action_12715397 ] 

#asfinfra IRC Bot commented on INFRA-1343:
------------------------------------------

<brettporter> yep, that makes sense to me


> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "Gavin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gavin closed INFRA-1343.
------------------------

    Resolution: Fixed

Done

> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "Brett Porter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brett Porter reopened INFRA-1343:
---------------------------------


we need to dig this up / recreate it (and apply to archiva too). googlebot + slurp triggered some problematic pages simultaneously to spike the load yesterday.

> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (INFRA-1343) setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages

Posted by "Gavin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/INFRA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gavin closed INFRA-1343.
------------------------

    Resolution: Fixed

This has been done recently.

> setup robots.txt and/or other access rules to prevent bots from crawling Continuum pages 
> -----------------------------------------------------------------------------------------
>
>                 Key: INFRA-1343
>                 URL: https://issues.apache.org/jira/browse/INFRA-1343
>             Project: Infrastructure
>          Issue Type: Task
>      Security Level: public(Regular issues) 
>          Components: Continuum
>            Reporter: Brett Porter
>
> We don't need search engines crawling the build pages (especially since it can navigate its way all the way through a working copy). It is picking up links from the mails sent out to mailing lists, presumably.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.