You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@community.apache.org by se...@apache.org on 2015/07/13 01:02:23 UTC
svn commit: r1690548 - in /comdev/projects.apache.org/scripts: README.txt cronjobs/parsereleases.py

Author: sebb
Date: Sun Jul 12 23:02:22 2015
New Revision: 1690548

URL: http://svn.apache.org/r1690548
Log:
EOL

Modified:
    comdev/projects.apache.org/scripts/README.txt   (contents, props changed)
    comdev/projects.apache.org/scripts/cronjobs/parsereleases.py   (contents, props changed)

Modified: comdev/projects.apache.org/scripts/README.txt
URL: http://svn.apache.org/viewvc/comdev/projects.apache.org/scripts/README.txt?rev=1690548&r1=1690547&r2=1690548&view=diff
==============================================================================
--- comdev/projects.apache.org/scripts/README.txt (original)
+++ comdev/projects.apache.org/scripts/README.txt Sun Jul 12 23:02:22 2015
@@ -1,50 +1,50 @@
-This directory contains Python 3 scripts for both importing and updating data from
-various sources:
-
-1. updating data (cronjobs)
-
-- countaccounts.py: Extract from LDAP monthly statistics on Unix accounts created
-  in: site/json/foundation/accounts-evolution.json + ldapsearch
-  out: site/json/foundation/accounts-evolution.json
-
-- parsechairs.py: Fetches current VPs from the foundation website.
-  in: http://www.apache.org/foundation/
-  out: site/json/foundation/chairs.json
-
-- parsecommitters.py: Fetches and parses the committer (LDAP) list via
-  people.apache.org.
-  in: http://people.apache.org/committer-index.html
-  out: site/json/foundation/people.json + site/json/foundation/groups.json
-  List of committers with reference to groups (people.json) and groups with corresponding committers (groups.json)
-
-- podlings.py: Reads podlings.xml from the incubator site and creates a JSON
-  with history data, as well as current podling projects information.
-  in: http://incubator.apache.org/podlings.xml
-  out: site/json/foundation/podlings.json + site/json/foundation/podlings-history.json
-  Current list of podlings (podlings.json) and ended podlings (podlings-history.json)
-
-- parsereleases.py
-  in: http://www.apache.org/dist/
-  out: json/foundation/releases.json
-     + json/foundation/releases-files.json
-
-
-2. importing data (import)
-
-- parsecommittees.py: Parses committee-info.txt to detect new and retired committees and imports PMC data (RDF) from
-  PMC data files
-  in: site/json/foundation/committees.json + site/json/foundation/committees-retired.json
-      + data/board/committee-info.txt (https://svn.apache.org/repos/private/committers/board/committee-info.txt)
-      + data/committees.xml + PMC data data/committees/*.rdf
-  out: site/json/foundation/committees.json + site/json/foundation/committees-retired.json + site/json/foundation/pmcs.json
-      + site/doap/{committeeId}/pmc-doap.rdf + site/doap/{committeeId}/pmc.rdf
-
-- parseprojects.py: Parses existing projects RDF(DOAP) files and turns them into JSON objects.
-  in: data/projects.xml + projects' DOAP files
-  out: site/json/projects/*.json + site/json/foundation/projects.json
-      + site/doap/{committeeId}/{project}.rdf
-
-NOTICE: what prevents import scripts to be added to cron?
-1. parse committees.py requires committee-info.txt, which is not available on project-vm (require authentication)
-2. both scripts not only update files but sometimes need to add new files (new committees or new projects) or move
-   (projects going to Attic or retired committees)
+This directory contains Python 3 scripts for both importing and updating data from
+various sources:
+
+1. updating data (cronjobs)
+
+- countaccounts.py: Extract from LDAP monthly statistics on Unix accounts created
+  in: site/json/foundation/accounts-evolution.json + ldapsearch
+  out: site/json/foundation/accounts-evolution.json
+
+- parsechairs.py: Fetches current VPs from the foundation website.
+  in: http://www.apache.org/foundation/
+  out: site/json/foundation/chairs.json
+
+- parsecommitters.py: Fetches and parses the committer (LDAP) list via
+  people.apache.org.
+  in: http://people.apache.org/committer-index.html
+  out: site/json/foundation/people.json + site/json/foundation/groups.json
+  List of committers with reference to groups (people.json) and groups with corresponding committers (groups.json)
+
+- podlings.py: Reads podlings.xml from the incubator site and creates a JSON
+  with history data, as well as current podling projects information.
+  in: http://incubator.apache.org/podlings.xml
+  out: site/json/foundation/podlings.json + site/json/foundation/podlings-history.json
+  Current list of podlings (podlings.json) and ended podlings (podlings-history.json)
+
+- parsereleases.py
+  in: http://www.apache.org/dist/
+  out: json/foundation/releases.json
+     + json/foundation/releases-files.json
+
+
+2. importing data (import)
+
+- parsecommittees.py: Parses committee-info.txt to detect new and retired committees and imports PMC data (RDF) from
+  PMC data files
+  in: site/json/foundation/committees.json + site/json/foundation/committees-retired.json
+      + data/board/committee-info.txt (https://svn.apache.org/repos/private/committers/board/committee-info.txt)
+      + data/committees.xml + PMC data data/committees/*.rdf
+  out: site/json/foundation/committees.json + site/json/foundation/committees-retired.json + site/json/foundation/pmcs.json
+      + site/doap/{committeeId}/pmc-doap.rdf + site/doap/{committeeId}/pmc.rdf
+
+- parseprojects.py: Parses existing projects RDF(DOAP) files and turns them into JSON objects.
+  in: data/projects.xml + projects' DOAP files
+  out: site/json/projects/*.json + site/json/foundation/projects.json
+      + site/doap/{committeeId}/{project}.rdf
+
+NOTICE: what prevents import scripts to be added to cron?
+1. parse committees.py requires committee-info.txt, which is not available on project-vm (require authentication)
+2. both scripts not only update files but sometimes need to add new files (new committees or new projects) or move
+   (projects going to Attic or retired committees)

Propchange: comdev/projects.apache.org/scripts/README.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: comdev/projects.apache.org/scripts/cronjobs/parsereleases.py
URL: http://svn.apache.org/viewvc/comdev/projects.apache.org/scripts/cronjobs/parsereleases.py?rev=1690548&r1=1690547&r2=1690548&view=diff
==============================================================================
--- comdev/projects.apache.org/scripts/cronjobs/parsereleases.py (original)
+++ comdev/projects.apache.org/scripts/cronjobs/parsereleases.py Sun Jul 12 23:02:22 2015
@@ -1,104 +1,104 @@
-import re, urllib.request
-import json
-import os
-
-"""
-Reads the list of files in http://www.apache.org/dist/
-
-Creates:
-../../site/json/foundation/releases.json
-../../site/json/foundation/releases-files.json
-
-TODO: it would probably be more efficient to parse the output of
-svn ls -R https://dist.apache.org/repos/dist/release/
-
-"""
-
-releases = {}
-files = {}
-mainurl = "http://www.apache.org/dist/"
-
-x = 0
-
-# don't try to maintain history for the moment...
-#try:
-#    with open("../../site/json/foundation/releases.json") as f:
-#        releases = json.loads(f.read())
-#        f.close()
-#except Exception as err:
-#    print("Could not read releases.json, assuming blank slate")
-
-def getDirList(url):
-    try:
-        data = urllib.request.urlopen(url).read().decode('utf-8')
-        for entry, xd, xdate in re.findall(r"<a href=\"([^\"/]+)(/?)\">.+</a>\s+(\d\d\d\d-\d\d-\d\d)", data, re.MULTILINE | re.UNICODE):
-            yield(entry, xdate, xd)
-    except:
-        pass
-
-def cleanFilename(filename):
-    for suffix in ['.tgz', '.gz', '.bz2', '.xz', '.zip', '.rar', '.tar', 'tar', '.deb', '.rpm', '.dmg', '.egg', '.gem', '.pom', '.war', '.exe',
-                   '-scala2.11', '-cdh4', '-hadoop1', '-hadoop2', '-hadoop2.3', '-hadoop2.4', '-all',
-                   '-src', '_src', '.src', '-sources', '_sources', '-source', '-bin', '-dist',
-                   '-source-release', '-source-relase', '-apidocs', '-javadocs', '-javadoc', '_javadoc', '-tests', '-test', '-debug', '-uber',
-                   '-macosx', '-distribution', '-example', '-manual', '-native', '-win', '-win32', '-linux', '-pack', '-packaged', '-lib', '-current', '-embedded',
-                   '-py', '-py2', '-py2.6', '-py2.7', '-no', 'unix-distro', 'windows-distro', 'with', '-dep', '-standalone', '-war', '-webapp', '-dom', '-om', '-manual', '-site',
-                   '-32bit', '-64bit', '-amd64', '-i386', '_i386', '.i386', '-x86_64', '-minimal', '-jettyconfig', '-py2.py3-none-any', 'newkey', 'oldkey', 'jars', '-jre13', '-hadoop1', '-hadoop2', '-project',
-                   '-with-dependencies', '-client', '-server', '-doc', '-docs', 'server-webapps', '-full', '-all', '-standard', '-for-javaee', '-for-tomcat',
-                   'hadoop1-scala2', '-deployer', '-fulldocs', '-windows-i64', '-windows-x64', '-embed', '-apps', '-app', '-ref', '-installer', '-bundle', '-java']:
-        if filename[len(filename)-len(suffix):] == suffix:
-            filename = filename[0:len(filename)-len(suffix)]
-    for repl in ['-assembly-', '-minimal-', '-doc-', '-src-', '-webapp-', '-standalone-', '-parent-', '-project-', '-win32-']:
-        filename = filename.replace(repl, '-')
-    return filename
-
-def cleanReleases(committeeId):
-    if len(releases[committeeId]) == 0:
-        del releases[committeeId]
-        del files[committeeId]
-
-def parseDir(committeeId, path):
-    print("              %s..." % path)
-    if len(path) > 100:
-        print("WARN too long path: recursion?")
-        return
-    for f, d, xd in getDirList("%s/%s" % (mainurl, path)):
-        if xd:
-            if ("/%s" % f) not in path and f.lower() not in ['binaries', 'repos', 'updatesite', 'current', 'stable', 'stable1', 'stable2', 'binary', 'notes', 'doc', 'eclipse', 'patches', 'docs', 'changes', 'features', 'tmp', 'cpp', 'php', 'ruby', 'py', 'py3', 'issuesfixed', 'images', 'styles', 'wikipages']:
-                parseDir(committeeId, "%s/%s" % (path, f))
-        elif not re.search(r"(MD5SUM|SHA1SUM|\.md5|\.mds|\.sh1|\.sh2|\.sha|\.asc|\.sig|\.bin|\.pom|\.jar|\.whl|\.pdf|\.xml|\.xsd|\.html|\.txt|\.cfg|\.ish|\.pl|RELEASE.NOTES|LICENSE|KEYS|CHANGELOG|NOTICE|MANIFEST|Changes|readme|x86|amd64|-manual\.|-docs\.|-docs-|-doc-|Announcement|current|-deps|-dependencies|binary|-bin-|-bin\.|-javadoc-|-distro|rat_report)", f, flags=re.IGNORECASE):
-            filename = cleanFilename(f)
-            if len(filename) > 1:
-                if filename not in releases[committeeId]:
-                    releases[committeeId][filename] = d
-                    files[committeeId][filename] = []
-                    print("                  - %s\t\t\t%s" % (filename, f))
-                files[committeeId][filename].append("%s/%s" % (path, f))
-
-
-for committeeId, d, xdir in getDirList(mainurl):
-    if committeeId != 'incubator':
-        if committeeId not in ['xml', 'zzz', 'maven-repository']:
-            print("Parsing /dist/%s content:" % committeeId)
-            releases[committeeId] = releases[committeeId] if committeeId in releases else {}
-            files[committeeId] = {}
-            parseDir(committeeId, committeeId)
-            cleanReleases(committeeId)
-    else:
-        for podling, d, xd in getDirList("%s/incubator/" % mainurl):
-            print("Parsing /dist/incubator-%s content:" % podling)
-            committeeId = "incubator-%s" % podling
-            releases[committeeId] = releases[committeeId] if committeeId in releases else {}
-            files[committeeId] = {}
-            parseDir(committeeId, "incubator/%s" % podling)
-            cleanReleases(committeeId)
-
-print("Writing releases.json")
-with open("../../site/json/foundation/releases.json", "w") as f:
-    f.write(json.dumps(releases, sort_keys=True, indent=0))
-    f.close()
-with open("../../site/json/foundation/releases-files.json", "w") as f:
-    f.write(json.dumps(files, sort_keys=True, indent=0))
-    f.close()
-
+import re, urllib.request
+import json
+import os
+
+"""
+Reads the list of files in http://www.apache.org/dist/
+
+Creates:
+../../site/json/foundation/releases.json
+../../site/json/foundation/releases-files.json
+
+TODO: it would probably be more efficient to parse the output of
+svn ls -R https://dist.apache.org/repos/dist/release/
+
+"""
+
+releases = {}
+files = {}
+mainurl = "http://www.apache.org/dist/"
+
+x = 0
+
+# don't try to maintain history for the moment...
+#try:
+#    with open("../../site/json/foundation/releases.json") as f:
+#        releases = json.loads(f.read())
+#        f.close()
+#except Exception as err:
+#    print("Could not read releases.json, assuming blank slate")
+
+def getDirList(url):
+    try:
+        data = urllib.request.urlopen(url).read().decode('utf-8')
+        for entry, xd, xdate in re.findall(r"<a href=\"([^\"/]+)(/?)\">.+</a>\s+(\d\d\d\d-\d\d-\d\d)", data, re.MULTILINE | re.UNICODE):
+            yield(entry, xdate, xd)
+    except:
+        pass
+
+def cleanFilename(filename):
+    for suffix in ['.tgz', '.gz', '.bz2', '.xz', '.zip', '.rar', '.tar', 'tar', '.deb', '.rpm', '.dmg', '.egg', '.gem', '.pom', '.war', '.exe',
+                   '-scala2.11', '-cdh4', '-hadoop1', '-hadoop2', '-hadoop2.3', '-hadoop2.4', '-all',
+                   '-src', '_src', '.src', '-sources', '_sources', '-source', '-bin', '-dist',
+                   '-source-release', '-source-relase', '-apidocs', '-javadocs', '-javadoc', '_javadoc', '-tests', '-test', '-debug', '-uber',
+                   '-macosx', '-distribution', '-example', '-manual', '-native', '-win', '-win32', '-linux', '-pack', '-packaged', '-lib', '-current', '-embedded',
+                   '-py', '-py2', '-py2.6', '-py2.7', '-no', 'unix-distro', 'windows-distro', 'with', '-dep', '-standalone', '-war', '-webapp', '-dom', '-om', '-manual', '-site',
+                   '-32bit', '-64bit', '-amd64', '-i386', '_i386', '.i386', '-x86_64', '-minimal', '-jettyconfig', '-py2.py3-none-any', 'newkey', 'oldkey', 'jars', '-jre13', '-hadoop1', '-hadoop2', '-project',
+                   '-with-dependencies', '-client', '-server', '-doc', '-docs', 'server-webapps', '-full', '-all', '-standard', '-for-javaee', '-for-tomcat',
+                   'hadoop1-scala2', '-deployer', '-fulldocs', '-windows-i64', '-windows-x64', '-embed', '-apps', '-app', '-ref', '-installer', '-bundle', '-java']:
+        if filename[len(filename)-len(suffix):] == suffix:
+            filename = filename[0:len(filename)-len(suffix)]
+    for repl in ['-assembly-', '-minimal-', '-doc-', '-src-', '-webapp-', '-standalone-', '-parent-', '-project-', '-win32-']:
+        filename = filename.replace(repl, '-')
+    return filename
+
+def cleanReleases(committeeId):
+    if len(releases[committeeId]) == 0:
+        del releases[committeeId]
+        del files[committeeId]
+
+def parseDir(committeeId, path):
+    print("              %s..." % path)
+    if len(path) > 100:
+        print("WARN too long path: recursion?")
+        return
+    for f, d, xd in getDirList("%s/%s" % (mainurl, path)):
+        if xd:
+            if ("/%s" % f) not in path and f.lower() not in ['binaries', 'repos', 'updatesite', 'current', 'stable', 'stable1', 'stable2', 'binary', 'notes', 'doc', 'eclipse', 'patches', 'docs', 'changes', 'features', 'tmp', 'cpp', 'php', 'ruby', 'py', 'py3', 'issuesfixed', 'images', 'styles', 'wikipages']:
+                parseDir(committeeId, "%s/%s" % (path, f))
+        elif not re.search(r"(MD5SUM|SHA1SUM|\.md5|\.mds|\.sh1|\.sh2|\.sha|\.asc|\.sig|\.bin|\.pom|\.jar|\.whl|\.pdf|\.xml|\.xsd|\.html|\.txt|\.cfg|\.ish|\.pl|RELEASE.NOTES|LICENSE|KEYS|CHANGELOG|NOTICE|MANIFEST|Changes|readme|x86|amd64|-manual\.|-docs\.|-docs-|-doc-|Announcement|current|-deps|-dependencies|binary|-bin-|-bin\.|-javadoc-|-distro|rat_report)", f, flags=re.IGNORECASE):
+            filename = cleanFilename(f)
+            if len(filename) > 1:
+                if filename not in releases[committeeId]:
+                    releases[committeeId][filename] = d
+                    files[committeeId][filename] = []
+                    print("                  - %s\t\t\t%s" % (filename, f))
+                files[committeeId][filename].append("%s/%s" % (path, f))
+
+
+for committeeId, d, xdir in getDirList(mainurl):
+    if committeeId != 'incubator':
+        if committeeId not in ['xml', 'zzz', 'maven-repository']:
+            print("Parsing /dist/%s content:" % committeeId)
+            releases[committeeId] = releases[committeeId] if committeeId in releases else {}
+            files[committeeId] = {}
+            parseDir(committeeId, committeeId)
+            cleanReleases(committeeId)
+    else:
+        for podling, d, xd in getDirList("%s/incubator/" % mainurl):
+            print("Parsing /dist/incubator-%s content:" % podling)
+            committeeId = "incubator-%s" % podling
+            releases[committeeId] = releases[committeeId] if committeeId in releases else {}
+            files[committeeId] = {}
+            parseDir(committeeId, "incubator/%s" % podling)
+            cleanReleases(committeeId)
+
+print("Writing releases.json")
+with open("../../site/json/foundation/releases.json", "w") as f:
+    f.write(json.dumps(releases, sort_keys=True, indent=0))
+    f.close()
+with open("../../site/json/foundation/releases-files.json", "w") as f:
+    f.write(json.dumps(files, sort_keys=True, indent=0))
+    f.close()
+
 print("All done!")
\ No newline at end of file

Propchange: comdev/projects.apache.org/scripts/cronjobs/parsereleases.py
------------------------------------------------------------------------------
    svn:eol-style = native