You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@community.apache.org by se...@apache.org on 2015/07/13 01:02:23 UTC
svn commit: r1690548 - in /comdev/projects.apache.org/scripts: README.txt
cronjobs/parsereleases.py
Author: sebb
Date: Sun Jul 12 23:02:22 2015
New Revision: 1690548
URL: http://svn.apache.org/r1690548
Log:
EOL
Modified:
comdev/projects.apache.org/scripts/README.txt (contents, props changed)
comdev/projects.apache.org/scripts/cronjobs/parsereleases.py (contents, props changed)
Modified: comdev/projects.apache.org/scripts/README.txt
URL: http://svn.apache.org/viewvc/comdev/projects.apache.org/scripts/README.txt?rev=1690548&r1=1690547&r2=1690548&view=diff
==============================================================================
--- comdev/projects.apache.org/scripts/README.txt (original)
+++ comdev/projects.apache.org/scripts/README.txt Sun Jul 12 23:02:22 2015
@@ -1,50 +1,50 @@
-This directory contains Python 3 scripts for both importing and updating data from
-various sources:
-
-1. updating data (cronjobs)
-
-- countaccounts.py: Extract from LDAP monthly statistics on Unix accounts created
- in: site/json/foundation/accounts-evolution.json + ldapsearch
- out: site/json/foundation/accounts-evolution.json
-
-- parsechairs.py: Fetches current VPs from the foundation website.
- in: http://www.apache.org/foundation/
- out: site/json/foundation/chairs.json
-
-- parsecommitters.py: Fetches and parses the committer (LDAP) list via
- people.apache.org.
- in: http://people.apache.org/committer-index.html
- out: site/json/foundation/people.json + site/json/foundation/groups.json
- List of committers with reference to groups (people.json) and groups with corresponding committers (groups.json)
-
-- podlings.py: Reads podlings.xml from the incubator site and creates a JSON
- with history data, as well as current podling projects information.
- in: http://incubator.apache.org/podlings.xml
- out: site/json/foundation/podlings.json + site/json/foundation/podlings-history.json
- Current list of podlings (podlings.json) and ended podlings (podlings-history.json)
-
-- parsereleases.py
- in: http://www.apache.org/dist/
- out: json/foundation/releases.json
- + json/foundation/releases-files.json
-
-
-2. importing data (import)
-
-- parsecommittees.py: Parses committee-info.txt to detect new and retired committees and imports PMC data (RDF) from
- PMC data files
- in: site/json/foundation/committees.json + site/json/foundation/committees-retired.json
- + data/board/committee-info.txt (https://svn.apache.org/repos/private/committers/board/committee-info.txt)
- + data/committees.xml + PMC data data/committees/*.rdf
- out: site/json/foundation/committees.json + site/json/foundation/committees-retired.json + site/json/foundation/pmcs.json
- + site/doap/{committeeId}/pmc-doap.rdf + site/doap/{committeeId}/pmc.rdf
-
-- parseprojects.py: Parses existing projects RDF(DOAP) files and turns them into JSON objects.
- in: data/projects.xml + projects' DOAP files
- out: site/json/projects/*.json + site/json/foundation/projects.json
- + site/doap/{committeeId}/{project}.rdf
-
-NOTICE: what prevents import scripts to be added to cron?
-1. parse committees.py requires committee-info.txt, which is not available on project-vm (require authentication)
-2. both scripts not only update files but sometimes need to add new files (new committees or new projects) or move
- (projects going to Attic or retired committees)
+This directory contains Python 3 scripts for both importing and updating data from
+various sources:
+
+1. updating data (cronjobs)
+
+- countaccounts.py: Extract from LDAP monthly statistics on Unix accounts created
+ in: site/json/foundation/accounts-evolution.json + ldapsearch
+ out: site/json/foundation/accounts-evolution.json
+
+- parsechairs.py: Fetches current VPs from the foundation website.
+ in: http://www.apache.org/foundation/
+ out: site/json/foundation/chairs.json
+
+- parsecommitters.py: Fetches and parses the committer (LDAP) list via
+ people.apache.org.
+ in: http://people.apache.org/committer-index.html
+ out: site/json/foundation/people.json + site/json/foundation/groups.json
+ List of committers with reference to groups (people.json) and groups with corresponding committers (groups.json)
+
+- podlings.py: Reads podlings.xml from the incubator site and creates a JSON
+ with history data, as well as current podling projects information.
+ in: http://incubator.apache.org/podlings.xml
+ out: site/json/foundation/podlings.json + site/json/foundation/podlings-history.json
+ Current list of podlings (podlings.json) and ended podlings (podlings-history.json)
+
+- parsereleases.py
+ in: http://www.apache.org/dist/
+ out: json/foundation/releases.json
+ + json/foundation/releases-files.json
+
+
+2. importing data (import)
+
+- parsecommittees.py: Parses committee-info.txt to detect new and retired committees and imports PMC data (RDF) from
+ PMC data files
+ in: site/json/foundation/committees.json + site/json/foundation/committees-retired.json
+ + data/board/committee-info.txt (https://svn.apache.org/repos/private/committers/board/committee-info.txt)
+ + data/committees.xml + PMC data data/committees/*.rdf
+ out: site/json/foundation/committees.json + site/json/foundation/committees-retired.json + site/json/foundation/pmcs.json
+ + site/doap/{committeeId}/pmc-doap.rdf + site/doap/{committeeId}/pmc.rdf
+
+- parseprojects.py: Parses existing projects RDF(DOAP) files and turns them into JSON objects.
+ in: data/projects.xml + projects' DOAP files
+ out: site/json/projects/*.json + site/json/foundation/projects.json
+ + site/doap/{committeeId}/{project}.rdf
+
+NOTICE: what prevents import scripts to be added to cron?
+1. parse committees.py requires committee-info.txt, which is not available on project-vm (require authentication)
+2. both scripts not only update files but sometimes need to add new files (new committees or new projects) or move
+ (projects going to Attic or retired committees)
Propchange: comdev/projects.apache.org/scripts/README.txt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: comdev/projects.apache.org/scripts/cronjobs/parsereleases.py
URL: http://svn.apache.org/viewvc/comdev/projects.apache.org/scripts/cronjobs/parsereleases.py?rev=1690548&r1=1690547&r2=1690548&view=diff
==============================================================================
--- comdev/projects.apache.org/scripts/cronjobs/parsereleases.py (original)
+++ comdev/projects.apache.org/scripts/cronjobs/parsereleases.py Sun Jul 12 23:02:22 2015
@@ -1,104 +1,104 @@
-import re, urllib.request
-import json
-import os
-
-"""
-Reads the list of files in http://www.apache.org/dist/
-
-Creates:
-../../site/json/foundation/releases.json
-../../site/json/foundation/releases-files.json
-
-TODO: it would probably be more efficient to parse the output of
-svn ls -R https://dist.apache.org/repos/dist/release/
-
-"""
-
-releases = {}
-files = {}
-mainurl = "http://www.apache.org/dist/"
-
-x = 0
-
-# don't try to maintain history for the moment...
-#try:
-# with open("../../site/json/foundation/releases.json") as f:
-# releases = json.loads(f.read())
-# f.close()
-#except Exception as err:
-# print("Could not read releases.json, assuming blank slate")
-
-def getDirList(url):
- try:
- data = urllib.request.urlopen(url).read().decode('utf-8')
- for entry, xd, xdate in re.findall(r"<a href=\"([^\"/]+)(/?)\">.+</a>\s+(\d\d\d\d-\d\d-\d\d)", data, re.MULTILINE | re.UNICODE):
- yield(entry, xdate, xd)
- except:
- pass
-
-def cleanFilename(filename):
- for suffix in ['.tgz', '.gz', '.bz2', '.xz', '.zip', '.rar', '.tar', 'tar', '.deb', '.rpm', '.dmg', '.egg', '.gem', '.pom', '.war', '.exe',
- '-scala2.11', '-cdh4', '-hadoop1', '-hadoop2', '-hadoop2.3', '-hadoop2.4', '-all',
- '-src', '_src', '.src', '-sources', '_sources', '-source', '-bin', '-dist',
- '-source-release', '-source-relase', '-apidocs', '-javadocs', '-javadoc', '_javadoc', '-tests', '-test', '-debug', '-uber',
- '-macosx', '-distribution', '-example', '-manual', '-native', '-win', '-win32', '-linux', '-pack', '-packaged', '-lib', '-current', '-embedded',
- '-py', '-py2', '-py2.6', '-py2.7', '-no', 'unix-distro', 'windows-distro', 'with', '-dep', '-standalone', '-war', '-webapp', '-dom', '-om', '-manual', '-site',
- '-32bit', '-64bit', '-amd64', '-i386', '_i386', '.i386', '-x86_64', '-minimal', '-jettyconfig', '-py2.py3-none-any', 'newkey', 'oldkey', 'jars', '-jre13', '-hadoop1', '-hadoop2', '-project',
- '-with-dependencies', '-client', '-server', '-doc', '-docs', 'server-webapps', '-full', '-all', '-standard', '-for-javaee', '-for-tomcat',
- 'hadoop1-scala2', '-deployer', '-fulldocs', '-windows-i64', '-windows-x64', '-embed', '-apps', '-app', '-ref', '-installer', '-bundle', '-java']:
- if filename[len(filename)-len(suffix):] == suffix:
- filename = filename[0:len(filename)-len(suffix)]
- for repl in ['-assembly-', '-minimal-', '-doc-', '-src-', '-webapp-', '-standalone-', '-parent-', '-project-', '-win32-']:
- filename = filename.replace(repl, '-')
- return filename
-
-def cleanReleases(committeeId):
- if len(releases[committeeId]) == 0:
- del releases[committeeId]
- del files[committeeId]
-
-def parseDir(committeeId, path):
- print(" %s..." % path)
- if len(path) > 100:
- print("WARN too long path: recursion?")
- return
- for f, d, xd in getDirList("%s/%s" % (mainurl, path)):
- if xd:
- if ("/%s" % f) not in path and f.lower() not in ['binaries', 'repos', 'updatesite', 'current', 'stable', 'stable1', 'stable2', 'binary', 'notes', 'doc', 'eclipse', 'patches', 'docs', 'changes', 'features', 'tmp', 'cpp', 'php', 'ruby', 'py', 'py3', 'issuesfixed', 'images', 'styles', 'wikipages']:
- parseDir(committeeId, "%s/%s" % (path, f))
- elif not re.search(r"(MD5SUM|SHA1SUM|\.md5|\.mds|\.sh1|\.sh2|\.sha|\.asc|\.sig|\.bin|\.pom|\.jar|\.whl|\.pdf|\.xml|\.xsd|\.html|\.txt|\.cfg|\.ish|\.pl|RELEASE.NOTES|LICENSE|KEYS|CHANGELOG|NOTICE|MANIFEST|Changes|readme|x86|amd64|-manual\.|-docs\.|-docs-|-doc-|Announcement|current|-deps|-dependencies|binary|-bin-|-bin\.|-javadoc-|-distro|rat_report)", f, flags=re.IGNORECASE):
- filename = cleanFilename(f)
- if len(filename) > 1:
- if filename not in releases[committeeId]:
- releases[committeeId][filename] = d
- files[committeeId][filename] = []
- print(" - %s\t\t\t%s" % (filename, f))
- files[committeeId][filename].append("%s/%s" % (path, f))
-
-
-for committeeId, d, xdir in getDirList(mainurl):
- if committeeId != 'incubator':
- if committeeId not in ['xml', 'zzz', 'maven-repository']:
- print("Parsing /dist/%s content:" % committeeId)
- releases[committeeId] = releases[committeeId] if committeeId in releases else {}
- files[committeeId] = {}
- parseDir(committeeId, committeeId)
- cleanReleases(committeeId)
- else:
- for podling, d, xd in getDirList("%s/incubator/" % mainurl):
- print("Parsing /dist/incubator-%s content:" % podling)
- committeeId = "incubator-%s" % podling
- releases[committeeId] = releases[committeeId] if committeeId in releases else {}
- files[committeeId] = {}
- parseDir(committeeId, "incubator/%s" % podling)
- cleanReleases(committeeId)
-
-print("Writing releases.json")
-with open("../../site/json/foundation/releases.json", "w") as f:
- f.write(json.dumps(releases, sort_keys=True, indent=0))
- f.close()
-with open("../../site/json/foundation/releases-files.json", "w") as f:
- f.write(json.dumps(files, sort_keys=True, indent=0))
- f.close()
-
+import re, urllib.request
+import json
+import os
+
+"""
+Reads the list of files in http://www.apache.org/dist/
+
+Creates:
+../../site/json/foundation/releases.json
+../../site/json/foundation/releases-files.json
+
+TODO: it would probably be more efficient to parse the output of
+svn ls -R https://dist.apache.org/repos/dist/release/
+
+"""
+
+releases = {}
+files = {}
+mainurl = "http://www.apache.org/dist/"
+
+x = 0
+
+# don't try to maintain history for the moment...
+#try:
+# with open("../../site/json/foundation/releases.json") as f:
+# releases = json.loads(f.read())
+# f.close()
+#except Exception as err:
+# print("Could not read releases.json, assuming blank slate")
+
+def getDirList(url):
+ try:
+ data = urllib.request.urlopen(url).read().decode('utf-8')
+ for entry, xd, xdate in re.findall(r"<a href=\"([^\"/]+)(/?)\">.+</a>\s+(\d\d\d\d-\d\d-\d\d)", data, re.MULTILINE | re.UNICODE):
+ yield(entry, xdate, xd)
+ except:
+ pass
+
+def cleanFilename(filename):
+ for suffix in ['.tgz', '.gz', '.bz2', '.xz', '.zip', '.rar', '.tar', 'tar', '.deb', '.rpm', '.dmg', '.egg', '.gem', '.pom', '.war', '.exe',
+ '-scala2.11', '-cdh4', '-hadoop1', '-hadoop2', '-hadoop2.3', '-hadoop2.4', '-all',
+ '-src', '_src', '.src', '-sources', '_sources', '-source', '-bin', '-dist',
+ '-source-release', '-source-relase', '-apidocs', '-javadocs', '-javadoc', '_javadoc', '-tests', '-test', '-debug', '-uber',
+ '-macosx', '-distribution', '-example', '-manual', '-native', '-win', '-win32', '-linux', '-pack', '-packaged', '-lib', '-current', '-embedded',
+ '-py', '-py2', '-py2.6', '-py2.7', '-no', 'unix-distro', 'windows-distro', 'with', '-dep', '-standalone', '-war', '-webapp', '-dom', '-om', '-manual', '-site',
+ '-32bit', '-64bit', '-amd64', '-i386', '_i386', '.i386', '-x86_64', '-minimal', '-jettyconfig', '-py2.py3-none-any', 'newkey', 'oldkey', 'jars', '-jre13', '-hadoop1', '-hadoop2', '-project',
+ '-with-dependencies', '-client', '-server', '-doc', '-docs', 'server-webapps', '-full', '-all', '-standard', '-for-javaee', '-for-tomcat',
+ 'hadoop1-scala2', '-deployer', '-fulldocs', '-windows-i64', '-windows-x64', '-embed', '-apps', '-app', '-ref', '-installer', '-bundle', '-java']:
+ if filename[len(filename)-len(suffix):] == suffix:
+ filename = filename[0:len(filename)-len(suffix)]
+ for repl in ['-assembly-', '-minimal-', '-doc-', '-src-', '-webapp-', '-standalone-', '-parent-', '-project-', '-win32-']:
+ filename = filename.replace(repl, '-')
+ return filename
+
+def cleanReleases(committeeId):
+ if len(releases[committeeId]) == 0:
+ del releases[committeeId]
+ del files[committeeId]
+
+def parseDir(committeeId, path):
+ print(" %s..." % path)
+ if len(path) > 100:
+ print("WARN too long path: recursion?")
+ return
+ for f, d, xd in getDirList("%s/%s" % (mainurl, path)):
+ if xd:
+ if ("/%s" % f) not in path and f.lower() not in ['binaries', 'repos', 'updatesite', 'current', 'stable', 'stable1', 'stable2', 'binary', 'notes', 'doc', 'eclipse', 'patches', 'docs', 'changes', 'features', 'tmp', 'cpp', 'php', 'ruby', 'py', 'py3', 'issuesfixed', 'images', 'styles', 'wikipages']:
+ parseDir(committeeId, "%s/%s" % (path, f))
+ elif not re.search(r"(MD5SUM|SHA1SUM|\.md5|\.mds|\.sh1|\.sh2|\.sha|\.asc|\.sig|\.bin|\.pom|\.jar|\.whl|\.pdf|\.xml|\.xsd|\.html|\.txt|\.cfg|\.ish|\.pl|RELEASE.NOTES|LICENSE|KEYS|CHANGELOG|NOTICE|MANIFEST|Changes|readme|x86|amd64|-manual\.|-docs\.|-docs-|-doc-|Announcement|current|-deps|-dependencies|binary|-bin-|-bin\.|-javadoc-|-distro|rat_report)", f, flags=re.IGNORECASE):
+ filename = cleanFilename(f)
+ if len(filename) > 1:
+ if filename not in releases[committeeId]:
+ releases[committeeId][filename] = d
+ files[committeeId][filename] = []
+ print(" - %s\t\t\t%s" % (filename, f))
+ files[committeeId][filename].append("%s/%s" % (path, f))
+
+
+for committeeId, d, xdir in getDirList(mainurl):
+ if committeeId != 'incubator':
+ if committeeId not in ['xml', 'zzz', 'maven-repository']:
+ print("Parsing /dist/%s content:" % committeeId)
+ releases[committeeId] = releases[committeeId] if committeeId in releases else {}
+ files[committeeId] = {}
+ parseDir(committeeId, committeeId)
+ cleanReleases(committeeId)
+ else:
+ for podling, d, xd in getDirList("%s/incubator/" % mainurl):
+ print("Parsing /dist/incubator-%s content:" % podling)
+ committeeId = "incubator-%s" % podling
+ releases[committeeId] = releases[committeeId] if committeeId in releases else {}
+ files[committeeId] = {}
+ parseDir(committeeId, "incubator/%s" % podling)
+ cleanReleases(committeeId)
+
+print("Writing releases.json")
+with open("../../site/json/foundation/releases.json", "w") as f:
+ f.write(json.dumps(releases, sort_keys=True, indent=0))
+ f.close()
+with open("../../site/json/foundation/releases-files.json", "w") as f:
+ f.write(json.dumps(files, sort_keys=True, indent=0))
+ f.close()
+
print("All done!")
\ No newline at end of file
Propchange: comdev/projects.apache.org/scripts/cronjobs/parsereleases.py
------------------------------------------------------------------------------
svn:eol-style = native