You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Paul Querna (JIRA)" <ji...@apache.org> on 2009/12/10 19:20:19 UTC

[jira] Created: (INFRA-2372) Produce sitemaps for services on Brutus

Produce sitemaps for services on Brutus
---------------------------------------

                 Key: INFRA-2372
                 URL: https://issues.apache.org/jira/browse/INFRA-2372
             Project: Infrastructure
          Issue Type: Improvement
      Security Level: public (Regular issues)
          Components: Bugzilla, Confluence, JIRA
            Reporter: Paul Querna


We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.

To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.

For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated INFRA-2372:
---------------------------------

    Component/s: Bugzilla

Adding BZ back on. I think this was lost in the security migration.

I have copies of the JIRA sitemap code, but not the bugzilla sitemap code. 

Looking on thor, the BZ sitemaps are there but haven't been updated since Apr 18. The JIRA sitemaps don't seem to have made it there.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Mark Thomas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788845#action_12788845 ] 

Mark Thomas commented on INFRA-2372:
------------------------------------

BZ already has sitemaps configured for SA and the main instance.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794393#action_12794393 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

First attempt done for the JIRAs.

On Brutus:

/home/jira/sitemap/sitemaps

Assuming they look good, going live would be:

a) Switching that directory so it's being generated to http://issues.apache.org/sitemaps/ from a crontab (i.e. to the bugzilla content dir)
b) Adding "Sitemap: http://issues.apache.org/sitemaps/jira_sitemap_index.xml" to the robots.txt

That assumes that we're happy having two indexes in robots.txt. 

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Jeff Turner (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Turner resolved INFRA-2372.
--------------------------------

    Resolution: Fixed

Fixed, r778959

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Confluence, JIRA
>            Reporter: Paul Querna
>            Assignee: Jeff Turner
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794792#action_12794792 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

JIRA sitemapping now hooked in to the bugzilla sitemaps.

Google webmaster tools validates the changes - submitted URLs goes up from 38k to 187k. Still pending the indexed numbers.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888186#action_12888186 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

Code for JIRA copied onto thor (jira user). Needs work as the psql cmd currently barfs. Fun Solaris I assume :)

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795018#action_12795018 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

Modified robots.txt

/jira/secure/attachment -> /jira/secure

stops Google looking at attachment pages and search result pages. Seems non-valuable for a user.

Also moved to disallowing roller/ and cayenne/ as those jiras no longer exist. Added a disallow on: /jira/browse/*?page= to stop dupes there. 

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Mark Thomas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Thomas updated INFRA-2372:
-------------------------------

    Component/s:     (was: Bugzilla)

Removing BZ since it already has this.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Jeff Turner (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Turner reassigned INFRA-2372:
----------------------------------

    Assignee: Jeff Turner

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Confluence, JIRA
>            Reporter: Paul Querna
>            Assignee: Jeff Turner
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795260#action_12795260 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

Total: 187,812, Indexed: 131,583

sitemaps/sitemap_activemq_1.xml.gz  	OK 	Sitemap 	Dec 30, 2009 	8,252 	7,857
sitemaps/sitemap_bugs.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	31,955 	12,347
sitemaps/sitemap_jira_1.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	50,000 	32,110
sitemaps/sitemap_jira_2.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	50,000 	39,346
sitemaps/sitemap_jira_3.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	33,634 	30,583
sitemaps/sitemap_sabugs.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	6,266 	3,624
sitemaps/sitemap_struts_1.xml.gz 	OK 	Sitemap 	Dec 29, 2009 	7,705 	5,716

Mark had suggested that he felt the non-indexed items were the older ones - I think this makes sense. If you view the main JIRA sitemaps, the %age covered goes up as they get newer.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Mark Thomas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Thomas updated INFRA-2372:
-------------------------------

    Component/s:     (was: Bugzilla)

BZ has been fixed for a while now.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794793#action_12794793 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

Is suspect we need more Disallow's in robots.txt. For example IssueNavigator and ConfigureReport.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847832#action_12847832 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

Note, we're about to start our 4th jira sitemap file. Need to double check at some point that that rollover works happily.

I also just removed the struts sitemap file, given its jira is now gone.

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (INFRA-2372) Produce sitemaps for services on Brutus

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/INFRA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793547#action_12793547 ] 

Henri Yandell commented on INFRA-2372:
--------------------------------------

For JIRA it's basically:

    select pkey, updated from jiraissue;

Need to add the text to output the timestamp correctly. Also to build the xml, and then split the large file into multiple 10M files. Could use the updated column as an optimization to avoid rebuilding things everytime. First time sort by updated, then updates could be handled by a where clause.

Feels like there should be a tool to do this - plug in the initial SQL and the SQL with where clause and away it goes. Anyone know of such a thing?

> Produce sitemaps for services on Brutus
> ---------------------------------------
>
>                 Key: INFRA-2372
>                 URL: https://issues.apache.org/jira/browse/INFRA-2372
>             Project: Infrastructure
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Bugzilla, Confluence, JIRA
>            Reporter: Paul Querna
>
> We are currently seeing a massive draw of bandwidth to brutus.apache.org, almost entirely from Googlebot / MSNbot.
> To resolve this without blocking the robots, we should produce XML sitemaps for Bugzilla, Confluence, and Jira.
> For BZ/Jira, it should be as simple as putting every issue into the sitemap, with a last modified time of the last comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.