You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geronimo.apache.org by Kevan Miller <ke...@gmail.com> on 2008/10/01 15:56:01 UTC

Re: Continuous TCK Testing

Not seeing too much progress here.  Has anyone dug up the Anthill- 
based code? I'll have a look.

--kevan

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
sorry been busy, I will write up some details tomorrow about this...  
but its late now and the sleep fairy is calling me.

--jason


On Oct 1, 2008, at 8:56 PM, Kevan Miller wrote:

>
> Not seeing too much progress here.  Has anyone dug up the Anthill- 
> based code? I'll have a look.
>
> --kevan


Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
Hi Jason,

Now that I've got my hands on anthill to play with, I'm diving into setting
up things.  I was hoping you'd be able to clarify for me a little bit how
the build-support stuff was integrated with the anthill stuff.
Specifically, I'm working on setting up a project in anthill to build the
geronimo server from the trunk in the repository.  This seems like a good
first step to me.  From what you specified in your explanation, it seems
that every step of this automated testing had an anthill project and a
corresponding groovy-based controller.  So for the geronimo build, I would
have an anthill project devoted to building the server which would help with
shuffling artifacts around, cleaning working directories, and other such
pre/post build things, but the actual build would be handled by the
controller?  So the anthill project would actually be launching the
controller rather than the build?

Thanks!

On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <ja...@gmail.com> wrote:

> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>
>  Is the GBuild stuff in svn the same as the anthill-based code or is that
>> something different?  GBuild seems to have scripts for running tck and that
>> leads me to think they're the same thing, but I see no mention of anthill in
>> the code.
>>
>
> The Anthill stuff is completely different than the GBuild stuff.  I started
> out trying to get the TCK automated using GBuild, but decided that the
> system lacked too many features to perform as I desired, and went ahead with
> Anthill as it did pretty much everything, though had some stability
> problems.
>
> One of the main reasons why I choose Anthill (AHP, Anthill Pro that is) was
> its build agent and code repository systems.  This allowed me to ensure that
> each build used exactly the desired artifacts.  Another was the configurable
> workflow, which allowed me to create a custom chain of events to handle
> running builds on remote agents and control what data gets set to them, what
> it will collect and what logic to execute once all distributed work has been
> completed for a particular build.  And the kicker which help facilitate
> bringing it all together was its concept of a build life.
>
> At the time I could find *no other* build tool which could meet all of
> these needs, and so I went with AHP instead of spending months
> building/testing features in GBuild.
>
> While AHP supports configuring a lot of stuff via its web-interface, I
> found that it was very cumbersome, so I opted to write some glue, which was
> stored in svn here:
>
>
> https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>
> Its been a while, so I have to refresh my memory on how this stuff actually
> worked.  First let me explain about the code repository (what it calls
> codestation) and why it was critical to the TCK testing IMO.  When we use
> Maven normally, it pulls data from a set of external repositories, picks up
> more repositories from the stuff it downloads and quickly we loose control
> where stuff comes from.  After it pulls down all that stuff, it churns
> though a build and spits out the stuff we care about, normally stuffing them
> (via mvn install) into the local repository.
>
> AHP supports by default tasks to publish artifacts (really just a set of
> files controlled by an Ant-like include/exclude path) from a build agent
> into Codestation, as well as tasks to resolve artifacts (ie. download them
> from Codestation to the local working directory on the build agents system).
>  Each top-level build in AHP gets assigned a new (empty) build life.
>  Artifacts are always published to/resolved from a build life, either that
> of the current build, or of a dependency build.
>
> So what I did was I setup builds for Geronimo Server (the normal
> server/trunk stuff), which did the normal mvn install thingy, but I always
> gave it a custom -Dmaven.local.repository which resolved to something inside
> the working directory for the running build.  The build was still online, so
> it pulled down a bunch of stuff into an empty local repository (so it was a
> clean build wrt the repository, as well as the source code, which was always
> fetched for each new build).  Once the build had finished, I used the
> artifact publisher task to push *all* of the stuff in the local repository
> into Codestation, labled as something like "Maven repository artifacts" for
> the current build life.
>
> Then I setup another build for Apache Geronimo CTS Server (the
> porting/branches/* stuff).  This build was dependent upon the "Maven
> repository artifacts" of the Geronimo Server build, and I configured those
> artifacts to get installed on the build agents system in the same directory
> that I configured the CTS Server build to use for its local maven
> repository.  So again the repo started out empty, then got populated with
> all of the outputs from the normal G build, and then the cts-server build
> was started.  The build of the components and assemblies is normally fairly
> quick and aside from some stuff in the private tck repo won't download muck
> more stuff, because it already had most of its dependencies installed via
> the Codestation dependency resolution.   Once the build finished, I
> published to cts-server assembly artifacts back to Codestation under like
> "CTS Server Assemblies" or something.
>
> Up until this point its normal builds, but now we have built the G server,
> then built the CTS server (using the *exact* artifacts from the G server
> build, even though each might have happened on a different build agent).
>  And now we need to go and run a bunch of tests, using the *exact* CTS
> server assemblies, produce some output, collect it, and once all of the
> tests are done render some nice reports, etc.
>
> AHP supports setting up builds which contain "parallel" tasks, each of
> those tasks is then performed by a build agent, they have fancy build agent
> selection stuff, but for my needs I had basically 2 groups, one group for
> running the server builds, and then another for running the tests.  I only
> set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to
> mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with
> about 10-12 Xen virtual machines running internally to run build agents.
>  Each system also had a RAID-0 array setup over 4 disks to help reduce disk
> io wait, which was as I found out the limiting factor when trying to run a
> ton of builds that all checkout and download artifacts and such.
>
> I helped the AHP team add a new feature which was an parallel iterator
> task, so you define *one* task that internally fires off n parallel tasks,
> which would set the iteration number, and leave it up to the build logic to
> pick what to do based on that index.  The alternative was a unwieldy set of
> like 200 tasks in their UI which simply didn't work at all.  You might have
> notice an "iterations.xml" file in the tck-testsuite directory, this was was
> was used to take an iteration number and turn it into what tests we actually
> run.  The <iteration> bits are order sensitive in that file.
>
> Soooo, after we have a CTS Server for a particular G Server build, we can
> no go an do "runtests" for a specific set of tests (defined by an
> iteration)... this differed from the other builds above a little, but still
> pulled down artifacts, the CTS Server assemblies (only the assemblies and
> the required bits to run the geronimo-maven-plugin, which was used to
> geronimo:install, as well as used by the tck itself to fire up the server
> and so on).  The key thing here, with regards to the maven configuration
> (besides using that custom Codestation populated repository) was that the
> builds were run *offline*.
>
> After runtests completed, the results are then soaked up (the stuff that
> javatest pukes out with icky details, as well as the full log files and
> other stuff I can recall) and then pushed back into Codestation.
>
> Once all of the iterations were finished, another task fires off which
> generates a report.  It does this by downloading from Codestation all of the
> runtests outputs (each was zipped I think), unzips them one by one, run some
> custom goo I wrote (based some of the concepts from original stuff from the
> GBuild-based TCK automation), and generates a nice Javadoc-like report that
> includes all of the gory details.
>
> I can't remember how long I spent working on this... too long (not the
> reports I mean, the whole system).  But in the end I recall something like
> running an entire TCK testsuite for a single server configuration (like
> jetty) in about 4-6 hours... I sent mail to the list with the results, so if
> you are curious what the real number is, instead of my guess, you can look
> for it there.  But anyway it was damn quick running on just those 2
> machines.  And I *knew* exactly that each of the distributed tests was
> actually testing a known build that I could trace back to its artifacts and
> then back to its SVN revision, without worrying about mvn downloading
> something new when midnight rolled over or that a new G server or CTS server
> build that might be in progress hasn't compromised the testing by polluting
> the local repository.
>
>  * * *
>
> So, about the sandbox/build-support stuff...
>
> First there is the 'harness' project, which is rather small, but contains
> the basic stuff, like a version of ant and maven which all of these builds
> would use, some other internal glue, a  fix for an evil Maven problem
> causing erroneous build failures due to some internal thread state
> corruption or gremlins, not sure which.  I kinda used this project to help
> manage the software needed by normal builds, which is why Ant and Maven were
> in there... ie. so I didn't have to go install it on each agent each time it
> changed, just let the AHP system deal with it for me.
>
> This was setup as a normal AHP project, built using its internal Ant
> builder (though having that builder configured still to use the local
> version it pulled from SVN to ensure it always works.
>
> Each other build was setup to depend on the output artifacts from the build
> harness build, using the latest in a range, like say using "3.*" for the
> latest 3.x build (which looks like that was 3.7).  This let me work on new
> stuff w/o breaking the current builds as I hacked things up.
>
> So, in addition to all of the stuff I mentioned above wrt the G and CTS
> builds, each also had this step which resolved the build harness artifacts
> to that working directory, and the Maven builds were always run via the
> version of Maven included from the harness.  But, AHP didn't actually run
> that version of Maven directly, it used its internal Ant task to execute the
> version of Ant from the harness *and* use the harness.xml buildfile.
>
> The harness.xml stuff is some more goo which I wrote to help mange AHP
> configurations.  With AHP (at that time, not sure if it has changed) you had
> to do most everything via the web UI, which sucked, and it was hard to
> refactor sets of projects and so on.  So I came up with a standard set of
> tasks to execute for a project, then put all of the custom muck I needed
> into what I called a _library_ and then had the AHP via harness.xml invoke
> it with some configuration about what project it was and other build
> details.
>
> The actual harness.xml is not very big, it simply makes sure that */bin/*
> is executable (codestation couldn't preserve execute bits), uses the
> Codestation command-line client (invoking the javaclass directly though) to
> ask the repository to resolve artifacts from the "Build Library" to the
> local repository.  I had this artifact resolution separate from the normal
> dependency (or harness) artifact resolution so that it was easier for me to
> fix problems with the library while a huge set of TCK iterations were still
> queued up to run.  Basically, if I noticed a problem due to a code or
> configuration issue in an early build, I could fix it, and use the existing
> builds to verify the fix, instead of wasting an hour (sometimes more
> depending on networking problems accessing remote repos while building the
> servers) to rebuild and start over.
>
> This brings us to the 'libraries' project.  In general the idea of a
> _library_ was just a named/versioned collection of files, where you could be
> used by a project.  The main (er only) library defined in this SVN is
> system/.  This is the groovy glue which made everything work.  This is where
> the entry-point class is located (the guy who gets invoked via harness.xml
> via:
>
>    <target name="harness" depends="init">
>        <groovy>
>            <classpath>
>                <pathelement location="${library.basedir}/groovy"/>
>            </classpath>
>
>            gbuild.system.BuildHarness.bootstrap(this)
>        </groovy>
>    </target>
>
> I won't go into too much detail on this stuff now, take a look at it and
> ask questions.  But, basically there is stuff in gbuild.system.* which is
> harness support muck, and stuff in gbuild.config.* which contains
> configuration.  I was kinda mid-refactoring of some things, starting to add
> new features, not sure where I left off actually. But the key bits are in
> gbuild.config.project.*  This contains a package for each project, with the
> package name being the same as the AHP project (with " " -> "_"). And then
> in each of those package is at least a Controller.groovy class (or other
> classes if special muck was needed, like for the report generation in
> Geronimo_CTS, etc).
>
> The controller defines a set of actions, implemented as Groovy closures
> bound to properties of the Controller class.  One of the properties passed
> in from the AHP configuration (configured via the Web UI, passed to the
> harness.xml build, and then on to the Groovy harness) was the name of the
> _action_ to execute.  Most of that stuff should be fairly straightforward.
>
> So after a build is started (maybe from a Web UI click, or SVN change
> detection, or a TCK runtests iteration) the following happens (in simplified
> terms):
>
>  * Agent starts build
>  * Agent cleans its working directory
>  * Agent downloads the build harness
>  * Agent downloads any dependencies
>  * Agent invoke Ant on harness.xml passing in some details
>  * Harness.xml downloads the system/1 library
>  * Harness.xml runs gbuild.system.BuildHarness
>  * BuildHarness tries to construct a Controller instance for the project
>  * BuildHarness tries to find Controller action to execute
>  * BuildHarness executes the Controller action
>  * Agent publishes output artifacts
>  * Agent completes build
>
> A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff we
> get from Sun to execute.  This stuff isn't small, but is for the most part
> read-only.  So I setup a location on each build agent where these files were
> installed to.  I created AHP projects to manage them and treated them like a
> special "library" one which tried really hard not to go fetch its content
> unless the local content was out of date.  This helped speed up the entire
> build process... cause that delete/download of all that muck really slows
> down 20 agents running in parallel on 2 big machines with stripped array.
>  For legal reasons this stuff was not kept in svn.apache.org's main
> repository, and for logistical reasons wasn't kept in the private tck repo
> on svn.apache.org either.  Because there were so many files, and be case
> the httpd configuration on svn.apache.org kicks out requests that it
> thinks are *bunk* to help save the resources for the community, I had setup
> a private ssl secured private svn repository on the old gbuild.orgmachines to put in the full muck required, then setup some goo in the
> harness to resolve them.  This goo is all in gbuild.system.library.*  See
> the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was
> actually used.
>
>  * * *
>
> Okay, that is about all the brain-dump for TCK muck I have in me for
> tonight.  Reply with questions if you have any.
>
> Cheers,
>
> --jason
>
>
>


-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
Yup, it was manually installed on each machine ;-)

--jason


On Oct 9, 2008, at 6:43 PM, Jason Warner wrote:

> My apologies.  I didn't phrase my question properly.  Most of the  
> software necessary was pulled down via svn, but I saw no such  
> behaviour for AHP.  After looking at it some more, I imagine the  
> software was just manually installed on the machine.  It was kind of  
> a silly question to begin with, I suppose.
>
> On Thu, Oct 9, 2008 at 4:16 AM, Jason Dillon  
> <ja...@gmail.com> wrote:
> On Oct 8, 2008, at 11:05 PM, Jason Warner wrote:
>> Here's a quick question.  Where does AHP come from?
>
> http://www.anthillpro.com
>
> (ever heard of google :-P)
>
> --jason
>
>
>>
>> On Mon, Oct 6, 2008 at 1:18 PM, Jason Dillon  
>> <ja...@gmail.com> wrote:
>> Sure np, took me a while to get around to writing it too ;-)
>>
>> --jason
>>
>>
>> On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:
>>
>>> Just got around to reading this.  Thanks for the brain dump,  
>>> Jason.  No questions as of yet, but I'm sure I'll need a few more  
>>> reads before I understand it all.
>>>
>>> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon  
>>> <ja...@gmail.com> wrote:
>>> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>>>
>>> Is the GBuild stuff in svn the same as the anthill-based code or  
>>> is that something different?  GBuild seems to have scripts for  
>>> running tck and that leads me to think they're the same thing, but  
>>> I see no mention of anthill in the code.
>>>
>>> The Anthill stuff is completely different than the GBuild stuff.   
>>> I started out trying to get the TCK automated using GBuild, but  
>>> decided that the system lacked too many features to perform as I  
>>> desired, and went ahead with Anthill as it did pretty much  
>>> everything, though had some stability problems.
>>>
>>> One of the main reasons why I choose Anthill (AHP, Anthill Pro  
>>> that is) was its build agent and code repository systems.  This  
>>> allowed me to ensure that each build used exactly the desired  
>>> artifacts.  Another was the configurable workflow, which allowed  
>>> me to create a custom chain of events to handle running builds on  
>>> remote agents and control what data gets set to them, what it will  
>>> collect and what logic to execute once all distributed work has  
>>> been completed for a particular build.  And the kicker which help  
>>> facilitate bringing it all together was its concept of a build life.
>>>
>>> At the time I could find *no other* build tool which could meet  
>>> all of these needs, and so I went with AHP instead of spending  
>>> months building/testing features in GBuild.
>>>
>>> While AHP supports configuring a lot of stuff via its web- 
>>> interface, I found that it was very cumbersome, so I opted to  
>>> write some glue, which was stored in svn here:
>>>
>>>    https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>>>
>>> Its been a while, so I have to refresh my memory on how this stuff  
>>> actually worked.  First let me explain about the code repository  
>>> (what it calls codestation) and why it was critical to the TCK  
>>> testing IMO.  When we use Maven normally, it pulls data from a set  
>>> of external repositories, picks up more repositories from the  
>>> stuff it downloads and quickly we loose control where stuff comes  
>>> from.  After it pulls down all that stuff, it churns though a  
>>> build and spits out the stuff we care about, normally stuffing  
>>> them (via mvn install) into the local repository.
>>>
>>> AHP supports by default tasks to publish artifacts (really just a  
>>> set of files controlled by an Ant-like include/exclude path) from  
>>> a build agent into Codestation, as well as tasks to resolve  
>>> artifacts (ie. download them from Codestation to the local working  
>>> directory on the build agents system).  Each top-level build in  
>>> AHP gets assigned a new (empty) build life.  Artifacts are always  
>>> published to/resolved from a build life, either that of the  
>>> current build, or of a dependency build.
>>>
>>> So what I did was I setup builds for Geronimo Server (the normal  
>>> server/trunk stuff), which did the normal mvn install thingy, but  
>>> I always gave it a custom -Dmaven.local.repository which resolved  
>>> to something inside the working directory for the running build.   
>>> The build was still online, so it pulled down a bunch of stuff  
>>> into an empty local repository (so it was a clean build wrt the  
>>> repository, as well as the source code, which was always fetched  
>>> for each new build).  Once the build had finished, I used the  
>>> artifact publisher task to push *all* of the stuff in the local  
>>> repository into Codestation, labled as something like "Maven  
>>> repository artifacts" for the current build life.
>>>
>>> Then I setup another build for Apache Geronimo CTS Server (the  
>>> porting/branches/* stuff).  This build was dependent upon the  
>>> "Maven repository artifacts" of the Geronimo Server build, and I  
>>> configured those artifacts to get installed on the build agents  
>>> system in the same directory that I configured the CTS Server  
>>> build to use for its local maven repository.  So again the repo  
>>> started out empty, then got populated with all of the outputs from  
>>> the normal G build, and then the cts-server build was started.   
>>> The build of the components and assemblies is normally fairly  
>>> quick and aside from some stuff in the private tck repo won't  
>>> download muck more stuff, because it already had most of its  
>>> dependencies installed via the Codestation dependency  
>>> resolution.   Once the build finished, I published to cts-server  
>>> assembly artifacts back to Codestation under like "CTS Server  
>>> Assemblies" or something.
>>>
>>> Up until this point its normal builds, but now we have built the G  
>>> server, then built the CTS server (using the *exact* artifacts  
>>> from the G server build, even though each might have happened on a  
>>> different build agent).  And now we need to go and run a bunch of  
>>> tests, using the *exact* CTS server assemblies, produce some  
>>> output, collect it, and once all of the tests are done render some  
>>> nice reports, etc.
>>>
>>> AHP supports setting up builds which contain "parallel" tasks,  
>>> each of those tasks is then performed by a build agent, they have  
>>> fancy build agent selection stuff, but for my needs I had  
>>> basically 2 groups, one group for running the server builds, and  
>>> then another for running the tests.  I only set aside like 2  
>>> agents for builds and the rest for tests.  Oh, I forgot to mention  
>>> that I had 2 16x 16g AMD beasts all running CentOS 5, each with  
>>> about 10-12 Xen virtual machines running internally to run build  
>>> agents.  Each system also had a RAID-0 array setup over 4 disks to  
>>> help reduce disk io wait, which was as I found out the limiting  
>>> factor when trying to run a ton of builds that all checkout and  
>>> download artifacts and such.
>>>
>>> I helped the AHP team add a new feature which was an parallel  
>>> iterator task, so you define *one* task that internally fires off  
>>> n parallel tasks, which would set the iteration number, and leave  
>>> it up to the build logic to pick what to do based on that index.   
>>> The alternative was a unwieldy set of like 200 tasks in their UI  
>>> which simply didn't work at all.  You might have notice an  
>>> "iterations.xml" file in the tck-testsuite directory, this was was  
>>> was used to take an iteration number and turn it into what tests  
>>> we actually run.  The <iteration> bits are order sensitive in that  
>>> file.
>>>
>>> Soooo, after we have a CTS Server for a particular G Server build,  
>>> we can no go an do "runtests" for a specific set of tests (defined  
>>> by an iteration)... this differed from the other builds above a  
>>> little, but still pulled down artifacts, the CTS Server assemblies  
>>> (only the assemblies and the required bits to run the geronimo- 
>>> maven-plugin, which was used to geronimo:install, as well as used  
>>> by the tck itself to fire up the server and so on).  The key thing  
>>> here, with regards to the maven configuration (besides using that  
>>> custom Codestation populated repository) was that the builds were  
>>> run *offline*.
>>>
>>> After runtests completed, the results are then soaked up (the  
>>> stuff that javatest pukes out with icky details, as well as the  
>>> full log files and other stuff I can recall) and then pushed back  
>>> into Codestation.
>>>
>>> Once all of the iterations were finished, another task fires off  
>>> which generates a report.  It does this by downloading from  
>>> Codestation all of the runtests outputs (each was zipped I think),  
>>> unzips them one by one, run some custom goo I wrote (based some of  
>>> the concepts from original stuff from the GBuild-based TCK  
>>> automation), and generates a nice Javadoc-like report that  
>>> includes all of the gory details.
>>>
>>> I can't remember how long I spent working on this... too long (not  
>>> the reports I mean, the whole system).  But in the end I recall  
>>> something like running an entire TCK testsuite for a single server  
>>> configuration (like jetty) in about 4-6 hours... I sent mail to  
>>> the list with the results, so if you are curious what the real  
>>> number is, instead of my guess, you can look for it there.  But  
>>> anyway it was damn quick running on just those 2 machines.  And I  
>>> *knew* exactly that each of the distributed tests was actually  
>>> testing a known build that I could trace back to its artifacts and  
>>> then back to its SVN revision, without worrying about mvn  
>>> downloading something new when midnight rolled over or that a new  
>>> G server or CTS server build that might be in progress hasn't  
>>> compromised the testing by polluting the local repository.
>>>
>>>  * * *
>>>
>>> So, about the sandbox/build-support stuff...
>>>
>>> First there is the 'harness' project, which is rather small, but  
>>> contains the basic stuff, like a version of ant and maven which  
>>> all of these builds would use, some other internal glue, a  fix  
>>> for an evil Maven problem causing erroneous build failures due to  
>>> some internal thread state corruption or gremlins, not sure  
>>> which.  I kinda used this project to help manage the software  
>>> needed by normal builds, which is why Ant and Maven were in  
>>> there... ie. so I didn't have to go install it on each agent each  
>>> time it changed, just let the AHP system deal with it for me.
>>>
>>> This was setup as a normal AHP project, built using its internal  
>>> Ant builder (though having that builder configured still to use  
>>> the local version it pulled from SVN to ensure it always works.
>>>
>>> Each other build was setup to depend on the output artifacts from  
>>> the build harness build, using the latest in a range, like say  
>>> using "3.*" for the latest 3.x build (which looks like that was  
>>> 3.7).  This let me work on new stuff w/o breaking the current  
>>> builds as I hacked things up.
>>>
>>> So, in addition to all of the stuff I mentioned above wrt the G  
>>> and CTS builds, each also had this step which resolved the build  
>>> harness artifacts to that working directory, and the Maven builds  
>>> were always run via the version of Maven included from the  
>>> harness.  But, AHP didn't actually run that version of Maven  
>>> directly, it used its internal Ant task to execute the version of  
>>> Ant from the harness *and* use the harness.xml buildfile.
>>>
>>> The harness.xml stuff is some more goo which I wrote to help mange  
>>> AHP configurations.  With AHP (at that time, not sure if it has  
>>> changed) you had to do most everything via the web UI, which  
>>> sucked, and it was hard to refactor sets of projects and so on.   
>>> So I came up with a standard set of tasks to execute for a  
>>> project, then put all of the custom muck I needed into what I  
>>> called a _library_ and then had the AHP via harness.xml invoke it  
>>> with some configuration about what project it was and other build  
>>> details.
>>>
>>> The actual harness.xml is not very big, it simply makes sure that  
>>> */bin/* is executable (codestation couldn't preserve execute  
>>> bits), uses the Codestation command-line client (invoking the  
>>> javaclass directly though) to ask the repository to resolve  
>>> artifacts from the "Build Library" to the local repository.  I had  
>>> this artifact resolution separate from the normal dependency (or  
>>> harness) artifact resolution so that it was easier for me to fix  
>>> problems with the library while a huge set of TCK iterations were  
>>> still queued up to run.  Basically, if I noticed a problem due to  
>>> a code or configuration issue in an early build, I could fix it,  
>>> and use the existing builds to verify the fix, instead of wasting  
>>> an hour (sometimes more depending on networking problems accessing  
>>> remote repos while building the servers) to rebuild and start over.
>>>
>>> This brings us to the 'libraries' project.  In general the idea of  
>>> a _library_ was just a named/versioned collection of files, where  
>>> you could be used by a project.  The main (er only) library  
>>> defined in this SVN is system/.  This is the groovy glue which  
>>> made everything work.  This is where the entry-point class is  
>>> located (the guy who gets invoked via harness.xml via:
>>>
>>>    <target name="harness" depends="init">
>>>        <groovy>
>>>            <classpath>
>>>                <pathelement location="${library.basedir}/groovy"/>
>>>            </classpath>
>>>
>>>            gbuild.system.BuildHarness.bootstrap(this)
>>>        </groovy>
>>>    </target>
>>>
>>> I won't go into too much detail on this stuff now, take a look at  
>>> it and ask questions.  But, basically there is stuff in  
>>> gbuild.system.* which is harness support muck, and stuff in  
>>> gbuild.config.* which contains configuration.  I was kinda mid- 
>>> refactoring of some things, starting to add new features, not sure  
>>> where I left off actually. But the key bits are in  
>>> gbuild.config.project.*  This contains a package for each project,  
>>> with the package name being the same as the AHP project (with " " - 
>>> > "_"). And then in each of those package is at least a  
>>> Controller.groovy class (or other classes if special muck was  
>>> needed, like for the report generation in Geronimo_CTS, etc).
>>>
>>> The controller defines a set of actions, implemented as Groovy  
>>> closures bound to properties of the Controller class.  One of the  
>>> properties passed in from the AHP configuration (configured via  
>>> the Web UI, passed to the harness.xml build, and then on to the  
>>> Groovy harness) was the name of the _action_ to execute.  Most of  
>>> that stuff should be fairly straightforward.
>>>
>>> So after a build is started (maybe from a Web UI click, or SVN  
>>> change detection, or a TCK runtests iteration) the following  
>>> happens (in simplified terms):
>>>
>>>  * Agent starts build
>>>  * Agent cleans its working directory
>>>  * Agent downloads the build harness
>>>  * Agent downloads any dependencies
>>>  * Agent invoke Ant on harness.xml passing in some details
>>>  * Harness.xml downloads the system/1 library
>>>  * Harness.xml runs gbuild.system.BuildHarness
>>>  * BuildHarness tries to construct a Controller instance for the  
>>> project
>>>  * BuildHarness tries to find Controller action to execute
>>>  * BuildHarness executes the Controller action
>>>  * Agent publishes output artifacts
>>>  * Agent completes build
>>>
>>> A few extra notes on libraries, the JavaEE TCK requires a bunch of  
>>> stuff we get from Sun to execute.  This stuff isn't small, but is  
>>> for the most part read-only.  So I setup a location on each build  
>>> agent where these files were installed to.  I created AHP projects  
>>> to manage them and treated them like a special "library" one which  
>>> tried really hard not to go fetch its content unless the local  
>>> content was out of date.  This helped speed up the entire build  
>>> process... cause that delete/download of all that muck really  
>>> slows down 20 agents running in parallel on 2 big machines with  
>>> stripped array.  For legal reasons this stuff was not kept in  
>>> svn.apache.org's main repository, and for logistical reasons  
>>> wasn't kept in the private tck repo on svn.apache.org either.   
>>> Because there were so many files, and be case the httpd  
>>> configuration on svn.apache.org kicks out requests that it thinks  
>>> are *bunk* to help save the resources for the community, I had  
>>> setup a private ssl secured private svn repository on the old  
>>> gbuild.org machines to put in the full muck required, then setup  
>>> some goo in the harness to resolve them.  This goo is all in  
>>> gbuild.system.library.*  See the  
>>> gbuild.config.projects.Geronimo_CTS.Controller for more of how it  
>>> was actually used.
>>>
>>>  * * *
>>>
>>> Okay, that is about all the brain-dump for TCK muck I have in me  
>>> for tonight.  Reply with questions if you have any.
>>>
>>> Cheers,
>>>
>>> --jason
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> ~Jason Warner
>>
>>
>>
>>
>> -- 
>> ~Jason Warner
>
>
>
>
> -- 
> ~Jason Warner


Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
My apologies.  I didn't phrase my question properly.  Most of the software
necessary was pulled down via svn, but I saw no such behaviour for AHP.
After looking at it some more, I imagine the software was just manually
installed on the machine.  It was kind of a silly question to begin with, I
suppose.

On Thu, Oct 9, 2008 at 4:16 AM, Jason Dillon <ja...@gmail.com> wrote:

> On Oct 8, 2008, at 11:05 PM, Jason Warner wrote:
>
> Here's a quick question.  Where does AHP come from?
>
>
> http://www.anthillpro.com
>
> (ever heard of google :-P)
>
> --jason
>
>
>
> On Mon, Oct 6, 2008 at 1:18 PM, Jason Dillon <ja...@gmail.com>wrote:
>
>> Sure np, took me a while to get around to writing it too ;-)
>> --jason
>>
>>
>> On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:
>>
>> Just got around to reading this.  Thanks for the brain dump, Jason.  No
>> questions as of yet, but I'm sure I'll need a few more reads before I
>> understand it all.
>>
>> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <ja...@gmail.com>wrote:
>>
>>> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>>>
>>>  Is the GBuild stuff in svn the same as the anthill-based code or is that
>>>> something different?  GBuild seems to have scripts for running tck and that
>>>> leads me to think they're the same thing, but I see no mention of anthill in
>>>> the code.
>>>>
>>>
>>> The Anthill stuff is completely different than the GBuild stuff.  I
>>> started out trying to get the TCK automated using GBuild, but decided that
>>> the system lacked too many features to perform as I desired, and went ahead
>>> with Anthill as it did pretty much everything, though had some stability
>>> problems.
>>>
>>> One of the main reasons why I choose Anthill (AHP, Anthill Pro that is)
>>> was its build agent and code repository systems.  This allowed me to ensure
>>> that each build used exactly the desired artifacts.  Another was the
>>> configurable workflow, which allowed me to create a custom chain of events
>>> to handle running builds on remote agents and control what data gets set to
>>> them, what it will collect and what logic to execute once all distributed
>>> work has been completed for a particular build.  And the kicker which help
>>> facilitate bringing it all together was its concept of a build life.
>>>
>>> At the time I could find *no other* build tool which could meet all of
>>> these needs, and so I went with AHP instead of spending months
>>> building/testing features in GBuild.
>>>
>>> While AHP supports configuring a lot of stuff via its web-interface, I
>>> found that it was very cumbersome, so I opted to write some glue, which was
>>> stored in svn here:
>>>
>>>
>>> https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>>>
>>> Its been a while, so I have to refresh my memory on how this stuff
>>> actually worked.  First let me explain about the code repository (what it
>>> calls codestation) and why it was critical to the TCK testing IMO.  When we
>>> use Maven normally, it pulls data from a set of external repositories, picks
>>> up more repositories from the stuff it downloads and quickly we loose
>>> control where stuff comes from.  After it pulls down all that stuff, it
>>> churns though a build and spits out the stuff we care about, normally
>>> stuffing them (via mvn install) into the local repository.
>>>
>>> AHP supports by default tasks to publish artifacts (really just a set of
>>> files controlled by an Ant-like include/exclude path) from a build agent
>>> into Codestation, as well as tasks to resolve artifacts (ie. download them
>>> from Codestation to the local working directory on the build agents system).
>>>  Each top-level build in AHP gets assigned a new (empty) build life.
>>>  Artifacts are always published to/resolved from a build life, either that
>>> of the current build, or of a dependency build.
>>>
>>> So what I did was I setup builds for Geronimo Server (the normal
>>> server/trunk stuff), which did the normal mvn install thingy, but I always
>>> gave it a custom -Dmaven.local.repository which resolved to something inside
>>> the working directory for the running build.  The build was still online, so
>>> it pulled down a bunch of stuff into an empty local repository (so it was a
>>> clean build wrt the repository, as well as the source code, which was always
>>> fetched for each new build).  Once the build had finished, I used the
>>> artifact publisher task to push *all* of the stuff in the local repository
>>> into Codestation, labled as something like "Maven repository artifacts" for
>>> the current build life.
>>>
>>> Then I setup another build for Apache Geronimo CTS Server (the
>>> porting/branches/* stuff).  This build was dependent upon the "Maven
>>> repository artifacts" of the Geronimo Server build, and I configured those
>>> artifacts to get installed on the build agents system in the same directory
>>> that I configured the CTS Server build to use for its local maven
>>> repository.  So again the repo started out empty, then got populated with
>>> all of the outputs from the normal G build, and then the cts-server build
>>> was started.  The build of the components and assemblies is normally fairly
>>> quick and aside from some stuff in the private tck repo won't download muck
>>> more stuff, because it already had most of its dependencies installed via
>>> the Codestation dependency resolution.   Once the build finished, I
>>> published to cts-server assembly artifacts back to Codestation under like
>>> "CTS Server Assemblies" or something.
>>>
>>> Up until this point its normal builds, but now we have built the G
>>> server, then built the CTS server (using the *exact* artifacts from the G
>>> server build, even though each might have happened on a different build
>>> agent).  And now we need to go and run a bunch of tests, using the *exact*
>>> CTS server assemblies, produce some output, collect it, and once all of the
>>> tests are done render some nice reports, etc.
>>>
>>> AHP supports setting up builds which contain "parallel" tasks, each of
>>> those tasks is then performed by a build agent, they have fancy build agent
>>> selection stuff, but for my needs I had basically 2 groups, one group for
>>> running the server builds, and then another for running the tests.  I only
>>> set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to
>>> mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with
>>> about 10-12 Xen virtual machines running internally to run build agents.
>>>  Each system also had a RAID-0 array setup over 4 disks to help reduce disk
>>> io wait, which was as I found out the limiting factor when trying to run a
>>> ton of builds that all checkout and download artifacts and such.
>>>
>>> I helped the AHP team add a new feature which was an parallel iterator
>>> task, so you define *one* task that internally fires off n parallel tasks,
>>> which would set the iteration number, and leave it up to the build logic to
>>> pick what to do based on that index.  The alternative was a unwieldy set of
>>> like 200 tasks in their UI which simply didn't work at all.  You might have
>>> notice an "iterations.xml" file in the tck-testsuite directory, this was was
>>> was used to take an iteration number and turn it into what tests we actually
>>> run.  The <iteration> bits are order sensitive in that file.
>>>
>>> Soooo, after we have a CTS Server for a particular G Server build, we can
>>> no go an do "runtests" for a specific set of tests (defined by an
>>> iteration)... this differed from the other builds above a little, but still
>>> pulled down artifacts, the CTS Server assemblies (only the assemblies and
>>> the required bits to run the geronimo-maven-plugin, which was used to
>>> geronimo:install, as well as used by the tck itself to fire up the server
>>> and so on).  The key thing here, with regards to the maven configuration
>>> (besides using that custom Codestation populated repository) was that the
>>> builds were run *offline*.
>>>
>>> After runtests completed, the results are then soaked up (the stuff that
>>> javatest pukes out with icky details, as well as the full log files and
>>> other stuff I can recall) and then pushed back into Codestation.
>>>
>>> Once all of the iterations were finished, another task fires off which
>>> generates a report.  It does this by downloading from Codestation all of the
>>> runtests outputs (each was zipped I think), unzips them one by one, run some
>>> custom goo I wrote (based some of the concepts from original stuff from the
>>> GBuild-based TCK automation), and generates a nice Javadoc-like report that
>>> includes all of the gory details.
>>>
>>> I can't remember how long I spent working on this... too long (not the
>>> reports I mean, the whole system).  But in the end I recall something like
>>> running an entire TCK testsuite for a single server configuration (like
>>> jetty) in about 4-6 hours... I sent mail to the list with the results, so if
>>> you are curious what the real number is, instead of my guess, you can look
>>> for it there.  But anyway it was damn quick running on just those 2
>>> machines.  And I *knew* exactly that each of the distributed tests was
>>> actually testing a known build that I could trace back to its artifacts and
>>> then back to its SVN revision, without worrying about mvn downloading
>>> something new when midnight rolled over or that a new G server or CTS server
>>> build that might be in progress hasn't compromised the testing by polluting
>>> the local repository.
>>>
>>>  * * *
>>>
>>> So, about the sandbox/build-support stuff...
>>>
>>> First there is the 'harness' project, which is rather small, but contains
>>> the basic stuff, like a version of ant and maven which all of these builds
>>> would use, some other internal glue, a  fix for an evil Maven problem
>>> causing erroneous build failures due to some internal thread state
>>> corruption or gremlins, not sure which.  I kinda used this project to help
>>> manage the software needed by normal builds, which is why Ant and Maven were
>>> in there... ie. so I didn't have to go install it on each agent each time it
>>> changed, just let the AHP system deal with it for me.
>>>
>>> This was setup as a normal AHP project, built using its internal Ant
>>> builder (though having that builder configured still to use the local
>>> version it pulled from SVN to ensure it always works.
>>>
>>> Each other build was setup to depend on the output artifacts from the
>>> build harness build, using the latest in a range, like say using "3.*" for
>>> the latest 3.x build (which looks like that was 3.7).  This let me work on
>>> new stuff w/o breaking the current builds as I hacked things up.
>>>
>>> So, in addition to all of the stuff I mentioned above wrt the G and CTS
>>> builds, each also had this step which resolved the build harness artifacts
>>> to that working directory, and the Maven builds were always run via the
>>> version of Maven included from the harness.  But, AHP didn't actually run
>>> that version of Maven directly, it used its internal Ant task to execute the
>>> version of Ant from the harness *and* use the harness.xml buildfile.
>>>
>>> The harness.xml stuff is some more goo which I wrote to help mange AHP
>>> configurations.  With AHP (at that time, not sure if it has changed) you had
>>> to do most everything via the web UI, which sucked, and it was hard to
>>> refactor sets of projects and so on.  So I came up with a standard set of
>>> tasks to execute for a project, then put all of the custom muck I needed
>>> into what I called a _library_ and then had the AHP via harness.xml invoke
>>> it with some configuration about what project it was and other build
>>> details.
>>>
>>> The actual harness.xml is not very big, it simply makes sure that */bin/*
>>> is executable (codestation couldn't preserve execute bits), uses the
>>> Codestation command-line client (invoking the javaclass directly though) to
>>> ask the repository to resolve artifacts from the "Build Library" to the
>>> local repository.  I had this artifact resolution separate from the normal
>>> dependency (or harness) artifact resolution so that it was easier for me to
>>> fix problems with the library while a huge set of TCK iterations were still
>>> queued up to run.  Basically, if I noticed a problem due to a code or
>>> configuration issue in an early build, I could fix it, and use the existing
>>> builds to verify the fix, instead of wasting an hour (sometimes more
>>> depending on networking problems accessing remote repos while building the
>>> servers) to rebuild and start over.
>>>
>>> This brings us to the 'libraries' project.  In general the idea of a
>>> _library_ was just a named/versioned collection of files, where you could be
>>> used by a project.  The main (er only) library defined in this SVN is
>>> system/.  This is the groovy glue which made everything work.  This is where
>>> the entry-point class is located (the guy who gets invoked via harness.xml
>>> via:
>>>
>>>    <target name="harness" depends="init">
>>>        <groovy>
>>>            <classpath>
>>>                <pathelement location="${library.basedir}/groovy"/>
>>>            </classpath>
>>>
>>>            gbuild.system.BuildHarness.bootstrap(this)
>>>        </groovy>
>>>    </target>
>>>
>>> I won't go into too much detail on this stuff now, take a look at it and
>>> ask questions.  But, basically there is stuff in gbuild.system.* which is
>>> harness support muck, and stuff in gbuild.config.* which contains
>>> configuration.  I was kinda mid-refactoring of some things, starting to add
>>> new features, not sure where I left off actually. But the key bits are in
>>> gbuild.config.project.*  This contains a package for each project, with the
>>> package name being the same as the AHP project (with " " -> "_"). And then
>>> in each of those package is at least a Controller.groovy class (or other
>>> classes if special muck was needed, like for the report generation in
>>> Geronimo_CTS, etc).
>>>
>>> The controller defines a set of actions, implemented as Groovy closures
>>> bound to properties of the Controller class.  One of the properties passed
>>> in from the AHP configuration (configured via the Web UI, passed to the
>>> harness.xml build, and then on to the Groovy harness) was the name of the
>>> _action_ to execute.  Most of that stuff should be fairly straightforward.
>>>
>>> So after a build is started (maybe from a Web UI click, or SVN change
>>> detection, or a TCK runtests iteration) the following happens (in simplified
>>> terms):
>>>
>>>  * Agent starts build
>>>  * Agent cleans its working directory
>>>  * Agent downloads the build harness
>>>  * Agent downloads any dependencies
>>>  * Agent invoke Ant on harness.xml passing in some details
>>>  * Harness.xml downloads the system/1 library
>>>  * Harness.xml runs gbuild.system.BuildHarness
>>>  * BuildHarness tries to construct a Controller instance for the project
>>>  * BuildHarness tries to find Controller action to execute
>>>  * BuildHarness executes the Controller action
>>>  * Agent publishes output artifacts
>>>  * Agent completes build
>>>
>>> A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff
>>> we get from Sun to execute.  This stuff isn't small, but is for the most
>>> part read-only.  So I setup a location on each build agent where these files
>>> were installed to.  I created AHP projects to manage them and treated them
>>> like a special "library" one which tried really hard not to go fetch its
>>> content unless the local content was out of date.  This helped speed up the
>>> entire build process... cause that delete/download of all that muck really
>>> slows down 20 agents running in parallel on 2 big machines with stripped
>>> array.  For legal reasons this stuff was not kept in svn.apache.org's
>>> main repository, and for logistical reasons wasn't kept in the private tck
>>> repo on svn.apache.org either.  Because there were so many files, and be
>>> case the httpd configuration on svn.apache.org kicks out requests that
>>> it thinks are *bunk* to help save the resources for the community, I had
>>> setup a private ssl secured private svn repository on the old gbuild.orgmachines to put in the full muck required, then setup some goo in the
>>> harness to resolve them.  This goo is all in gbuild.system.library.*  See
>>> the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was
>>> actually used.
>>>
>>>  * * *
>>>
>>> Okay, that is about all the brain-dump for TCK muck I have in me for
>>> tonight.  Reply with questions if you have any.
>>>
>>> Cheers,
>>>
>>> --jason
>>>
>>>
>>>
>>
>>
>> --
>> ~Jason Warner
>>
>>
>>
>
>
> --
> ~Jason Warner
>
>
>


-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
On Oct 8, 2008, at 11:05 PM, Jason Warner wrote:
> Here's a quick question.  Where does AHP come from?

http://www.anthillpro.com

(ever heard of google :-P)

--jason


>
> On Mon, Oct 6, 2008 at 1:18 PM, Jason Dillon  
> <ja...@gmail.com> wrote:
> Sure np, took me a while to get around to writing it too ;-)
>
> --jason
>
>
> On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:
>
>> Just got around to reading this.  Thanks for the brain dump,  
>> Jason.  No questions as of yet, but I'm sure I'll need a few more  
>> reads before I understand it all.
>>
>> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon  
>> <ja...@gmail.com> wrote:
>> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>>
>> Is the GBuild stuff in svn the same as the anthill-based code or is  
>> that something different?  GBuild seems to have scripts for running  
>> tck and that leads me to think they're the same thing, but I see no  
>> mention of anthill in the code.
>>
>> The Anthill stuff is completely different than the GBuild stuff.  I  
>> started out trying to get the TCK automated using GBuild, but  
>> decided that the system lacked too many features to perform as I  
>> desired, and went ahead with Anthill as it did pretty much  
>> everything, though had some stability problems.
>>
>> One of the main reasons why I choose Anthill (AHP, Anthill Pro that  
>> is) was its build agent and code repository systems.  This allowed  
>> me to ensure that each build used exactly the desired artifacts.   
>> Another was the configurable workflow, which allowed me to create a  
>> custom chain of events to handle running builds on remote agents  
>> and control what data gets set to them, what it will collect and  
>> what logic to execute once all distributed work has been completed  
>> for a particular build.  And the kicker which help facilitate  
>> bringing it all together was its concept of a build life.
>>
>> At the time I could find *no other* build tool which could meet all  
>> of these needs, and so I went with AHP instead of spending months  
>> building/testing features in GBuild.
>>
>> While AHP supports configuring a lot of stuff via its web- 
>> interface, I found that it was very cumbersome, so I opted to write  
>> some glue, which was stored in svn here:
>>
>>    https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>>
>> Its been a while, so I have to refresh my memory on how this stuff  
>> actually worked.  First let me explain about the code repository  
>> (what it calls codestation) and why it was critical to the TCK  
>> testing IMO.  When we use Maven normally, it pulls data from a set  
>> of external repositories, picks up more repositories from the stuff  
>> it downloads and quickly we loose control where stuff comes from.   
>> After it pulls down all that stuff, it churns though a build and  
>> spits out the stuff we care about, normally stuffing them (via mvn  
>> install) into the local repository.
>>
>> AHP supports by default tasks to publish artifacts (really just a  
>> set of files controlled by an Ant-like include/exclude path) from a  
>> build agent into Codestation, as well as tasks to resolve artifacts  
>> (ie. download them from Codestation to the local working directory  
>> on the build agents system).  Each top-level build in AHP gets  
>> assigned a new (empty) build life.  Artifacts are always published  
>> to/resolved from a build life, either that of the current build, or  
>> of a dependency build.
>>
>> So what I did was I setup builds for Geronimo Server (the normal  
>> server/trunk stuff), which did the normal mvn install thingy, but I  
>> always gave it a custom -Dmaven.local.repository which resolved to  
>> something inside the working directory for the running build.  The  
>> build was still online, so it pulled down a bunch of stuff into an  
>> empty local repository (so it was a clean build wrt the repository,  
>> as well as the source code, which was always fetched for each new  
>> build).  Once the build had finished, I used the artifact publisher  
>> task to push *all* of the stuff in the local repository into  
>> Codestation, labled as something like "Maven repository artifacts"  
>> for the current build life.
>>
>> Then I setup another build for Apache Geronimo CTS Server (the  
>> porting/branches/* stuff).  This build was dependent upon the  
>> "Maven repository artifacts" of the Geronimo Server build, and I  
>> configured those artifacts to get installed on the build agents  
>> system in the same directory that I configured the CTS Server build  
>> to use for its local maven repository.  So again the repo started  
>> out empty, then got populated with all of the outputs from the  
>> normal G build, and then the cts-server build was started.  The  
>> build of the components and assemblies is normally fairly quick and  
>> aside from some stuff in the private tck repo won't download muck  
>> more stuff, because it already had most of its dependencies  
>> installed via the Codestation dependency resolution.   Once the  
>> build finished, I published to cts-server assembly artifacts back  
>> to Codestation under like "CTS Server Assemblies" or something.
>>
>> Up until this point its normal builds, but now we have built the G  
>> server, then built the CTS server (using the *exact* artifacts from  
>> the G server build, even though each might have happened on a  
>> different build agent).  And now we need to go and run a bunch of  
>> tests, using the *exact* CTS server assemblies, produce some  
>> output, collect it, and once all of the tests are done render some  
>> nice reports, etc.
>>
>> AHP supports setting up builds which contain "parallel" tasks, each  
>> of those tasks is then performed by a build agent, they have fancy  
>> build agent selection stuff, but for my needs I had basically 2  
>> groups, one group for running the server builds, and then another  
>> for running the tests.  I only set aside like 2 agents for builds  
>> and the rest for tests.  Oh, I forgot to mention that I had 2 16x  
>> 16g AMD beasts all running CentOS 5, each with about 10-12 Xen  
>> virtual machines running internally to run build agents.  Each  
>> system also had a RAID-0 array setup over 4 disks to help reduce  
>> disk io wait, which was as I found out the limiting factor when  
>> trying to run a ton of builds that all checkout and download  
>> artifacts and such.
>>
>> I helped the AHP team add a new feature which was an parallel  
>> iterator task, so you define *one* task that internally fires off n  
>> parallel tasks, which would set the iteration number, and leave it  
>> up to the build logic to pick what to do based on that index.  The  
>> alternative was a unwieldy set of like 200 tasks in their UI which  
>> simply didn't work at all.  You might have notice an  
>> "iterations.xml" file in the tck-testsuite directory, this was was  
>> was used to take an iteration number and turn it into what tests we  
>> actually run.  The <iteration> bits are order sensitive in that file.
>>
>> Soooo, after we have a CTS Server for a particular G Server build,  
>> we can no go an do "runtests" for a specific set of tests (defined  
>> by an iteration)... this differed from the other builds above a  
>> little, but still pulled down artifacts, the CTS Server assemblies  
>> (only the assemblies and the required bits to run the geronimo- 
>> maven-plugin, which was used to geronimo:install, as well as used  
>> by the tck itself to fire up the server and so on).  The key thing  
>> here, with regards to the maven configuration (besides using that  
>> custom Codestation populated repository) was that the builds were  
>> run *offline*.
>>
>> After runtests completed, the results are then soaked up (the stuff  
>> that javatest pukes out with icky details, as well as the full log  
>> files and other stuff I can recall) and then pushed back into  
>> Codestation.
>>
>> Once all of the iterations were finished, another task fires off  
>> which generates a report.  It does this by downloading from  
>> Codestation all of the runtests outputs (each was zipped I think),  
>> unzips them one by one, run some custom goo I wrote (based some of  
>> the concepts from original stuff from the GBuild-based TCK  
>> automation), and generates a nice Javadoc-like report that includes  
>> all of the gory details.
>>
>> I can't remember how long I spent working on this... too long (not  
>> the reports I mean, the whole system).  But in the end I recall  
>> something like running an entire TCK testsuite for a single server  
>> configuration (like jetty) in about 4-6 hours... I sent mail to the  
>> list with the results, so if you are curious what the real number  
>> is, instead of my guess, you can look for it there.  But anyway it  
>> was damn quick running on just those 2 machines.  And I *knew*  
>> exactly that each of the distributed tests was actually testing a  
>> known build that I could trace back to its artifacts and then back  
>> to its SVN revision, without worrying about mvn downloading  
>> something new when midnight rolled over or that a new G server or  
>> CTS server build that might be in progress hasn't compromised the  
>> testing by polluting the local repository.
>>
>>  * * *
>>
>> So, about the sandbox/build-support stuff...
>>
>> First there is the 'harness' project, which is rather small, but  
>> contains the basic stuff, like a version of ant and maven which all  
>> of these builds would use, some other internal glue, a  fix for an  
>> evil Maven problem causing erroneous build failures due to some  
>> internal thread state corruption or gremlins, not sure which.  I  
>> kinda used this project to help manage the software needed by  
>> normal builds, which is why Ant and Maven were in there... ie. so I  
>> didn't have to go install it on each agent each time it changed,  
>> just let the AHP system deal with it for me.
>>
>> This was setup as a normal AHP project, built using its internal  
>> Ant builder (though having that builder configured still to use the  
>> local version it pulled from SVN to ensure it always works.
>>
>> Each other build was setup to depend on the output artifacts from  
>> the build harness build, using the latest in a range, like say  
>> using "3.*" for the latest 3.x build (which looks like that was  
>> 3.7).  This let me work on new stuff w/o breaking the current  
>> builds as I hacked things up.
>>
>> So, in addition to all of the stuff I mentioned above wrt the G and  
>> CTS builds, each also had this step which resolved the build  
>> harness artifacts to that working directory, and the Maven builds  
>> were always run via the version of Maven included from the  
>> harness.  But, AHP didn't actually run that version of Maven  
>> directly, it used its internal Ant task to execute the version of  
>> Ant from the harness *and* use the harness.xml buildfile.
>>
>> The harness.xml stuff is some more goo which I wrote to help mange  
>> AHP configurations.  With AHP (at that time, not sure if it has  
>> changed) you had to do most everything via the web UI, which  
>> sucked, and it was hard to refactor sets of projects and so on.  So  
>> I came up with a standard set of tasks to execute for a project,  
>> then put all of the custom muck I needed into what I called a  
>> _library_ and then had the AHP via harness.xml invoke it with some  
>> configuration about what project it was and other build details.
>>
>> The actual harness.xml is not very big, it simply makes sure that */ 
>> bin/* is executable (codestation couldn't preserve execute bits),  
>> uses the Codestation command-line client (invoking the javaclass  
>> directly though) to ask the repository to resolve artifacts from  
>> the "Build Library" to the local repository.  I had this artifact  
>> resolution separate from the normal dependency (or harness)  
>> artifact resolution so that it was easier for me to fix problems  
>> with the library while a huge set of TCK iterations were still  
>> queued up to run.  Basically, if I noticed a problem due to a code  
>> or configuration issue in an early build, I could fix it, and use  
>> the existing builds to verify the fix, instead of wasting an hour  
>> (sometimes more depending on networking problems accessing remote  
>> repos while building the servers) to rebuild and start over.
>>
>> This brings us to the 'libraries' project.  In general the idea of  
>> a _library_ was just a named/versioned collection of files, where  
>> you could be used by a project.  The main (er only) library defined  
>> in this SVN is system/.  This is the groovy glue which made  
>> everything work.  This is where the entry-point class is located  
>> (the guy who gets invoked via harness.xml via:
>>
>>    <target name="harness" depends="init">
>>        <groovy>
>>            <classpath>
>>                <pathelement location="${library.basedir}/groovy"/>
>>            </classpath>
>>
>>            gbuild.system.BuildHarness.bootstrap(this)
>>        </groovy>
>>    </target>
>>
>> I won't go into too much detail on this stuff now, take a look at  
>> it and ask questions.  But, basically there is stuff in  
>> gbuild.system.* which is harness support muck, and stuff in  
>> gbuild.config.* which contains configuration.  I was kinda mid- 
>> refactoring of some things, starting to add new features, not sure  
>> where I left off actually. But the key bits are in  
>> gbuild.config.project.*  This contains a package for each project,  
>> with the package name being the same as the AHP project (with " " - 
>> > "_"). And then in each of those package is at least a  
>> Controller.groovy class (or other classes if special muck was  
>> needed, like for the report generation in Geronimo_CTS, etc).
>>
>> The controller defines a set of actions, implemented as Groovy  
>> closures bound to properties of the Controller class.  One of the  
>> properties passed in from the AHP configuration (configured via the  
>> Web UI, passed to the harness.xml build, and then on to the Groovy  
>> harness) was the name of the _action_ to execute.  Most of that  
>> stuff should be fairly straightforward.
>>
>> So after a build is started (maybe from a Web UI click, or SVN  
>> change detection, or a TCK runtests iteration) the following  
>> happens (in simplified terms):
>>
>>  * Agent starts build
>>  * Agent cleans its working directory
>>  * Agent downloads the build harness
>>  * Agent downloads any dependencies
>>  * Agent invoke Ant on harness.xml passing in some details
>>  * Harness.xml downloads the system/1 library
>>  * Harness.xml runs gbuild.system.BuildHarness
>>  * BuildHarness tries to construct a Controller instance for the  
>> project
>>  * BuildHarness tries to find Controller action to execute
>>  * BuildHarness executes the Controller action
>>  * Agent publishes output artifacts
>>  * Agent completes build
>>
>> A few extra notes on libraries, the JavaEE TCK requires a bunch of  
>> stuff we get from Sun to execute.  This stuff isn't small, but is  
>> for the most part read-only.  So I setup a location on each build  
>> agent where these files were installed to.  I created AHP projects  
>> to manage them and treated them like a special "library" one which  
>> tried really hard not to go fetch its content unless the local  
>> content was out of date.  This helped speed up the entire build  
>> process... cause that delete/download of all that muck really slows  
>> down 20 agents running in parallel on 2 big machines with stripped  
>> array.  For legal reasons this stuff was not kept in  
>> svn.apache.org's main repository, and for logistical reasons wasn't  
>> kept in the private tck repo on svn.apache.org either.  Because  
>> there were so many files, and be case the httpd configuration on  
>> svn.apache.org kicks out requests that it thinks are *bunk* to help  
>> save the resources for the community, I had setup a private ssl  
>> secured private svn repository on the old gbuild.org machines to  
>> put in the full muck required, then setup some goo in the harness  
>> to resolve them.  This goo is all in gbuild.system.library.*  See  
>> the gbuild.config.projects.Geronimo_CTS.Controller for more of how  
>> it was actually used.
>>
>>  * * *
>>
>> Okay, that is about all the brain-dump for TCK muck I have in me  
>> for tonight.  Reply with questions if you have any.
>>
>> Cheers,
>>
>> --jason
>>
>>
>>
>>
>>
>> -- 
>> ~Jason Warner
>
>
>
>
> -- 
> ~Jason Warner


Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
I'd imagine we need to ask the AHP folks for  a new license.

--jason


On Oct 9, 2008, at 10:56 AM, Kevan Miller wrote:

>
> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>
>> We had some suggestions earlier for some alternate means of  
>> implementing this (Hudson, Conitnuum, etc...).  Now that we've had  
>> Jason Dillon provide an overview of what we had in place before,  
>> does anyone have thoughts on what we should go with?  I'm thinking  
>> we should stick with the AHP based solution.  It will need to be  
>> updated most likely, but it's been tried and tested and shown to  
>> meet our needs.  I'm wondering, though, why we stopped using it  
>> before.  Was there a specific issue we're going to have to deal  
>> with again?
>
> IIRC, the overwhelming reason we stopped using it before was because  
> of hosting issues -- spotty networking, hardware failures, poor colo  
> support, etc. We shouldn't have any of these problems, now. If we do  
> run into problems, they should now be fixable. I have no reason to  
> favor Hudson/Continuum over AHP. So, if we can get AHP running  
> easily, I'm all for it. There's only one potential issue, that I'm  
> aware of.
>
> We previously had an Open Source License issued for our use of  
> Anthill. Here's some of the old discussion -- http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>
> Although the board was aware of our usage of AntHill, since we  
> weren't running AntHill on ASF hardware, I'm not sure the license  
> was fully vetted by Infra. I don't see any issues, but I'll want to  
> run this by Infra.
>
> Jason D, will the existing license cover the version of AntHill that  
> we'll want to use? I'll run the license by Infra and will also  
> describe the issue for review by the Board, in our quarterly report.
>
> IMO, I'd proceed with the assumption that we'll be using AHP. Just  
> don't install it on Apache hardware, yet.
>
> --kevan


Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
Yup, might need to resurrect that stuff if we plan on using it again.

--jason


On Oct 16, 2008, at 10:39 PM, Jason Warner wrote:

> Whoops... just realized that this was actually removed and I was  
> looking at a stickied revision of viewVC.  Nevermind.
>
> On Thu, Oct 16, 2008 at 11:15 AM, Jason Warner <ja...@gmail.com>  
> wrote:
> While we wait to hear back in regards to the license, I'm going to  
> update the maven used in build-support.  The server now requires  
> 2.0.9 and the version currently used by build support is 2.0.5.  I  
> suppose we'll need to update ant, as well.  What version of ant  
> should we be using?  1.7.1?
>
>
> On Fri, Oct 10, 2008 at 11:25 AM, Kevan Miller  
> <ke...@gmail.com> wrote:
>
> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>
>>
>> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>>
>>> We had some suggestions earlier for some alternate means of  
>>> implementing this (Hudson, Conitnuum, etc...).  Now that we've had  
>>> Jason Dillon provide an overview of what we had in place before,  
>>> does anyone have thoughts on what we should go with?  I'm thinking  
>>> we should stick with the AHP based solution.  It will need to be  
>>> updated most likely, but it's been tried and tested and shown to  
>>> meet our needs.  I'm wondering, though, why we stopped using it  
>>> before.  Was there a specific issue we're going to have to deal  
>>> with again?
>>
>> IIRC, the overwhelming reason we stopped using it before was  
>> because of hosting issues -- spotty networking, hardware failures,  
>> poor colo support, etc. We shouldn't have any of these problems,  
>> now. If we do run into problems, they should now be fixable. I have  
>> no reason to favor Hudson/Continuum over AHP. So, if we can get AHP  
>> running easily, I'm all for it. There's only one potential issue,  
>> that I'm aware of.
>>
>> We previously had an Open Source License issued for our use of  
>> Anthill. Here's some of the old discussion -- http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>>
>> Although the board was aware of our usage of AntHill, since we  
>> weren't running AntHill on ASF hardware, I'm not sure the license  
>> was fully vetted by Infra. I don't see any issues, but I'll want to  
>> run this by Infra.
>>
>> Jason D, will the existing license cover the version of AntHill  
>> that we'll want to use? I'll run the license by Infra and will also  
>> describe the issue for review by the Board, in our quarterly report.
>>
>> IMO, I'd proceed with the assumption that we'll be using AHP. Just  
>> don't install it on Apache hardware, yet.
>
> I've requested a new license from Anthill. Will let you know when I  
> get it.
>
> --kevan
>
>
>
>
> -- 
> ~Jason Warner
>
>
>
> -- 
> ~Jason Warner


Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
Whoops... just realized that this was actually removed and I was looking at
a stickied revision of viewVC.  Nevermind.

On Thu, Oct 16, 2008 at 11:15 AM, Jason Warner <ja...@gmail.com> wrote:

> While we wait to hear back in regards to the license, I'm going to update
> the maven used in build-support.  The server now requires 2.0.9 and the
> version currently used by build support is 2.0.5.  I suppose we'll need to
> update ant, as well.  What version of ant should we be using?  1.7.1?
>
>
> On Fri, Oct 10, 2008 at 11:25 AM, Kevan Miller <ke...@gmail.com>wrote:
>
>>
>> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>>
>>
>> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>>
>> We had some suggestions earlier for some alternate means of implementing
>> this (Hudson, Conitnuum, etc...).  Now that we've had Jason Dillon provide
>> an overview of what we had in place before, does anyone have thoughts on
>> what we should go with?  I'm thinking we should stick with the AHP based
>> solution.  It will need to be updated most likely, but it's been tried and
>> tested and shown to meet our needs.  I'm wondering, though, why we stopped
>> using it before.  Was there a specific issue we're going to have to deal
>> with again?
>>
>>
>> IIRC, the overwhelming reason we stopped using it before was because of
>> hosting issues -- spotty networking, hardware failures, poor colo support,
>> etc. We shouldn't have any of these problems, now. If we do run into
>> problems, they should now be fixable. I have no reason to favor
>> Hudson/Continuum over AHP. So, if we can get AHP running easily, I'm all for
>> it. There's only one potential issue, that I'm aware of.
>>
>> We previously had an Open Source License issued for our use of Anthill.
>> Here's some of the old discussion --
>> http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902<http://www.nabble.com/Geronimo-build-automation-status-%28longish%29-tt7649902.html#a7649902>
>>
>> Although the board was aware of our usage of AntHill, since we weren't
>> running AntHill on ASF hardware, I'm not sure the license was fully vetted
>> by Infra. I don't see any issues, but I'll want to run this by Infra.
>>
>> Jason D, will the existing license cover the version of AntHill that we'll
>> want to use? I'll run the license by Infra and will also describe the issue
>> for review by the Board, in our quarterly report.
>>
>> IMO, I'd proceed with the assumption that we'll be using AHP. Just don't
>> install it on Apache hardware, yet.
>>
>>
>> I've requested a new license from Anthill. Will let you know when I get
>> it.
>>
>> --kevan
>>
>>
>
>
> --
> ~Jason Warner
>



-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
While we wait to hear back in regards to the license, I'm going to update
the maven used in build-support.  The server now requires 2.0.9 and the
version currently used by build support is 2.0.5.  I suppose we'll need to
update ant, as well.  What version of ant should we be using?  1.7.1?

On Fri, Oct 10, 2008 at 11:25 AM, Kevan Miller <ke...@gmail.com>wrote:

>
> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>
>
> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>
> We had some suggestions earlier for some alternate means of implementing
> this (Hudson, Conitnuum, etc...).  Now that we've had Jason Dillon provide
> an overview of what we had in place before, does anyone have thoughts on
> what we should go with?  I'm thinking we should stick with the AHP based
> solution.  It will need to be updated most likely, but it's been tried and
> tested and shown to meet our needs.  I'm wondering, though, why we stopped
> using it before.  Was there a specific issue we're going to have to deal
> with again?
>
>
> IIRC, the overwhelming reason we stopped using it before was because of
> hosting issues -- spotty networking, hardware failures, poor colo support,
> etc. We shouldn't have any of these problems, now. If we do run into
> problems, they should now be fixable. I have no reason to favor
> Hudson/Continuum over AHP. So, if we can get AHP running easily, I'm all for
> it. There's only one potential issue, that I'm aware of.
>
> We previously had an Open Source License issued for our use of Anthill.
> Here's some of the old discussion --
> http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902<http://www.nabble.com/Geronimo-build-automation-status-%28longish%29-tt7649902.html#a7649902>
>
> Although the board was aware of our usage of AntHill, since we weren't
> running AntHill on ASF hardware, I'm not sure the license was fully vetted
> by Infra. I don't see any issues, but I'll want to run this by Infra.
>
> Jason D, will the existing license cover the version of AntHill that we'll
> want to use? I'll run the license by Infra and will also describe the issue
> for review by the Board, in our quarterly report.
>
> IMO, I'd proceed with the assumption that we'll be using AHP. Just don't
> install it on Apache hardware, yet.
>
>
> I've requested a new license from Anthill. Will let you know when I get it.
>
> --kevan
>
>


-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
Before when I had those 2 build machines running in my apartment in  
berkeley, I setup one xen domain specifically for running monitoring  
tools, and installed cacti on it, and then setup snmpd on each of the  
other machines configured to allow access from the xen monitoring  
domain.  This provided a very detail easy to grok monitoring console  
for the build agents.

--jason


On Oct 18, 2008, at 5:58 AM, Jay D. McHugh wrote:

> Hey Kevan,
>
> Regarding monitoring...
>
> I managed to run into xenmon.py.
>
> It appears to log the system utilization for the whole box as well  
> as each
> VM to log files in 'your' home directory if you specify the '-n' flag.
>
> Here is the help page for xenmon.py:
> jaydm@phoebe:~$ sudo python /usr/sbin/xenmon.py -h
> Usage: xenmon.py [options]
>
> Options:
>  -h, --help            show this help message and exit
>  -l, --live            show the ncurses live monitoring frontend  
> (default)
>  -n, --notlive         write to file instead of live monitoring
>  -p PREFIX, --prefix=PREFIX
>                        prefix to use for output files
>  -t DURATION, --time=DURATION
>                        stop logging to file after this much time has  
> elapsed
>                        (in seconds). set to 0 to keep logging  
> indefinitely
>  -i INTERVAL, --interval=INTERVAL
>                        interval for logging (in ms)
>  --ms_per_sample=MSPERSAMPLE
>                        determines how many ms worth of data goes in  
> a sample
>  --cpu=CPU             specifies which cpu to display data for
>  --allocated           Display allocated time for each domain
>  --noallocated         Don't display allocated time for each domain
>  --blocked             Display blocked time for each domain
>  --noblocked           Don't display blocked time for each domain
>  --waited              Display waiting time for each domain
>  --nowaited            Don't display waiting time for each domain
>  --excount             Display execution count for each domain
>  --noexcount           Don't display execution count for each domain
>  --iocount             Display I/O count for each domain
>  --noiocount           Don't display I/O count for each domain
>
> And here is some sample output:
>
> jaydm@phoebe:~$ cat log-dom0.log
> # passed cpu dom cpu(tot) cpu(%) cpu/ex allocated/ex blocked(tot)  
> blocked(%) blocked/io waited(tot) waited(%) waited/ex ex/s io(tot)  
> io/ex
> 0.000 0 0 2.086 0.000 38863.798 30000000.000 154.177 0.000 0.000  
> 0.504 0.000 9383.278 0.000 0.000 0.000
> 2.750 1 0 2.512 0.000 53804.925 30000000.000 153.217 0.000 0.000  
> 0.316 0.000 6774.813 0.000 0.000 0.000
> 4.063 2 0 2.625 0.000 59959.942 30000000.000 153.886 0.000 0.000  
> 0.173 0.000 3939.987 0.000 0.000 0.000
> 5.203 3 0 3.020 0.000 47522.430 30000000.000 171.834 0.000 0.000  
> 0.701 0.000 11031.759 0.000 0.000 0.000
> 6.403 4 0 2.130 0.000 39256.871 30000000.000 171.870 0.000 0.000  
> 0.617 0.000 11378.014 0.000 0.000 0.000
> 9.230 6 0 0.836 0.000 53962.875 30000000.000 57.287 0.000 0.000  
> 0.038 0.000 2450.488 0.000 0.000 0.000
> 10.305 7 0 2.171 0.000 46119.247 30000000.000 154.008 0.000 0.000  
> 0.367 0.000 7804.444 0.000 0.000 0.000
> 11.518 0 0 15931680.822 1.593 54019.023 30000000.000 889706824.191  
> 88.971 0.000 2630292.436 0.263 8918.446 294.927 0.000 0.000
> 1009.216 1 0 7687035.544 0.769 53822.548 30000000.000 473101345.004  
> 47.310 0.000 864964.568 0.086 6056.248 142.822 0.000 0.000
> 1010.199 2 0 20502235.224 2.050 61655.293 30000000.000 979188763.754  
> 97.919 0.000 4279443600.516 427.944 12869345.608 332.530 0.000 0.000
> 1011.239 3 0 13634865.766 1.363 45934.870 30000000.000 985479796.363  
> 98.548 0.000 1593248.596 0.159 5367.538 296.830 0.000 0.000
> 1012.312 4 0 18228049.181 1.823 61242.790 30000000.000 979822521.396  
> 97.982 0.000 2593364.560 0.259 8713.213 297.636 0.000 0.000
> 1013.338 5 0 9891757.872 0.989 65386.046 30000000.000 571275802.794  
> 57.128 0.000 357431.539 0.036 2362.678 151.282 0.000 0.000
>
> We could probably add a cron job to grab a single sample every X  
> minutes
> and append them together to build up a utilization history (rather  
> than
> simply running it all of the time).
>
> I just tried to get a single sample and the smallest run I could get  
> was
> about three seconds with four samples taken.
>
> Or, I also tried xentop in batch mode:
>
> jaydm@phoebe:~$ sudo xentop -b -i 1
>      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k)  
> MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD    
> VBD_WR SSID
>  Domain-0 -----r     430567    0.0    3939328   23.5   no  
> limit       n/a     8    4        0        0    0        0         
> 0        0 2149631536
>     tck01 --b---     750449    0.0    3145728   18.8    3145728       
> 18.8     2    1   483054  1855493    1       15   655667  8445829  
> 2149631536
>     tck02 --b---    1101273    0.0    3145728   18.8    3145728       
> 18.8     2    1   367792  1773407    1       83  1131709  9030663  
> 2149631536
>     tck03 -----r     144552    0.0    3145728   18.8    3145728       
> 18.8     2    1   188115  2370069    1        6   370431  1290683  
> 2149631536
>     tck04 --b---     103742    0.0    3145728   18.8    3145728       
> 18.8     2    1   286936  2341941    1        7   381523  1484476  
> 2149631536
>
> It looks to me like having a cron job that periodically ran xentop and
> build up a history would be the best option (without digging through
> a ton of different specialized monitor packages).
>
>
> Jay
>
> Kevan Miller wrote:
>>
>> On Oct 10, 2008, at 11:29 AM, Kevan Miller wrote:
>>
>>>
>>> On Oct 10, 2008, at 11:25 AM, Kevan Miller wrote:
>>>
>>>>
>>>> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>>>>
>>>>>
>>>>> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>>>>>
>>>>>> We had some suggestions earlier for some alternate means of
>>>>>> implementing this (Hudson, Conitnuum, etc...).  Now that we've  
>>>>>> had
>>>>>> Jason Dillon provide an overview of what we had in place before,
>>>>>> does anyone have thoughts on what we should go with?  I'm  
>>>>>> thinking
>>>>>> we should stick with the AHP based solution.  It will need to be
>>>>>> updated most likely, but it's been tried and tested and shown to
>>>>>> meet our needs.  I'm wondering, though, why we stopped using it
>>>>>> before.  Was there a specific issue we're going to have to deal
>>>>>> with again?
>>>>>
>>>>> IIRC, the overwhelming reason we stopped using it before was  
>>>>> because
>>>>> of hosting issues -- spotty networking, hardware failures, poor  
>>>>> colo
>>>>> support, etc. We shouldn't have any of these problems, now. If  
>>>>> we do
>>>>> run into problems, they should now be fixable. I have no reason to
>>>>> favor Hudson/Continuum over AHP. So, if we can get AHP running
>>>>> easily, I'm all for it. There's only one potential issue, that I'm
>>>>> aware of.
>>>>>
>>>>> We previously had an Open Source License issued for our use of
>>>>> Anthill. Here's some of the old discussion --
>>>>> http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>>>>>
>>>>>
>>>>> Although the board was aware of our usage of AntHill, since we
>>>>> weren't running AntHill on ASF hardware, I'm not sure the license
>>>>> was fully vetted by Infra. I don't see any issues, but I'll want  
>>>>> to
>>>>> run this by Infra.
>>>>>
>>>>> Jason D, will the existing license cover the version of AntHill  
>>>>> that
>>>>> we'll want to use? I'll run the license by Infra and will also
>>>>> describe the issue for review by the Board, in our quarterly  
>>>>> report.
>>
>> Heh. Oops. Just noticed that I sent the following to myself and not  
>> the
>> dev list. I hate when I do that...
>>
>>>
>>> One more thing... from emails on infrastructure@apache.org looks  
>>> like
>>> Infra is cool with us running Anthill on selene and phoebe.
>>>
>>> BTW, am planning on installing monitoring software over the  
>>> weekend on
>>> selene and phoebe. The board is interested in monitoring our  
>>> usage...
>>
>>
>> Also, we now have a new AntHill license for our use. I've placed the
>> license in ~kevan/License2.txt on phoebe and selene. This license  
>> should
>> only be used for Apache use. So, should not be placed in a public
>> location (e.g.  our public svn tree).
>>
>> Regarding monitoring software -- I haven't been able to get it to  
>> work
>> yet. vmstat/iostat don't work, unless you run on every virtual  
>> machine.
>> 'xm top' gathers data on all domains, however, doesn't make the data
>> easy to tuck away in a log file/available to snmp... Advice  
>> welcome...
>>
>> --kevan
>>


Re: Continuous TCK Testing

Posted by "Jay D. McHugh" <ja...@gmail.com>.
Hey Kevan,

Regarding monitoring...

I managed to run into xenmon.py.

It appears to log the system utilization for the whole box as well as each
VM to log files in 'your' home directory if you specify the '-n' flag.

Here is the help page for xenmon.py:
jaydm@phoebe:~$ sudo python /usr/sbin/xenmon.py -h
Usage: xenmon.py [options]

Options:
  -h, --help            show this help message and exit
  -l, --live            show the ncurses live monitoring frontend (default)
  -n, --notlive         write to file instead of live monitoring
  -p PREFIX, --prefix=PREFIX
                        prefix to use for output files
  -t DURATION, --time=DURATION
                        stop logging to file after this much time has elapsed
                        (in seconds). set to 0 to keep logging indefinitely
  -i INTERVAL, --interval=INTERVAL
                        interval for logging (in ms)
  --ms_per_sample=MSPERSAMPLE
                        determines how many ms worth of data goes in a sample
  --cpu=CPU             specifies which cpu to display data for
  --allocated           Display allocated time for each domain
  --noallocated         Don't display allocated time for each domain
  --blocked             Display blocked time for each domain
  --noblocked           Don't display blocked time for each domain
  --waited              Display waiting time for each domain
  --nowaited            Don't display waiting time for each domain
  --excount             Display execution count for each domain
  --noexcount           Don't display execution count for each domain
  --iocount             Display I/O count for each domain
  --noiocount           Don't display I/O count for each domain

And here is some sample output:

jaydm@phoebe:~$ cat log-dom0.log
# passed cpu dom cpu(tot) cpu(%) cpu/ex allocated/ex blocked(tot) blocked(%) blocked/io waited(tot) waited(%) waited/ex ex/s io(tot) io/ex
0.000 0 0 2.086 0.000 38863.798 30000000.000 154.177 0.000 0.000 0.504 0.000 9383.278 0.000 0.000 0.000
2.750 1 0 2.512 0.000 53804.925 30000000.000 153.217 0.000 0.000 0.316 0.000 6774.813 0.000 0.000 0.000
4.063 2 0 2.625 0.000 59959.942 30000000.000 153.886 0.000 0.000 0.173 0.000 3939.987 0.000 0.000 0.000
5.203 3 0 3.020 0.000 47522.430 30000000.000 171.834 0.000 0.000 0.701 0.000 11031.759 0.000 0.000 0.000
6.403 4 0 2.130 0.000 39256.871 30000000.000 171.870 0.000 0.000 0.617 0.000 11378.014 0.000 0.000 0.000
9.230 6 0 0.836 0.000 53962.875 30000000.000 57.287 0.000 0.000 0.038 0.000 2450.488 0.000 0.000 0.000
10.305 7 0 2.171 0.000 46119.247 30000000.000 154.008 0.000 0.000 0.367 0.000 7804.444 0.000 0.000 0.000
11.518 0 0 15931680.822 1.593 54019.023 30000000.000 889706824.191 88.971 0.000 2630292.436 0.263 8918.446 294.927 0.000 0.000
1009.216 1 0 7687035.544 0.769 53822.548 30000000.000 473101345.004 47.310 0.000 864964.568 0.086 6056.248 142.822 0.000 0.000
1010.199 2 0 20502235.224 2.050 61655.293 30000000.000 979188763.754 97.919 0.000 4279443600.516 427.944 12869345.608 332.530 0.000 0.000
1011.239 3 0 13634865.766 1.363 45934.870 30000000.000 985479796.363 98.548 0.000 1593248.596 0.159 5367.538 296.830 0.000 0.000
1012.312 4 0 18228049.181 1.823 61242.790 30000000.000 979822521.396 97.982 0.000 2593364.560 0.259 8713.213 297.636 0.000 0.000
1013.338 5 0 9891757.872 0.989 65386.046 30000000.000 571275802.794 57.128 0.000 357431.539 0.036 2362.678 151.282 0.000 0.000

We could probably add a cron job to grab a single sample every X minutes
and append them together to build up a utilization history (rather than
simply running it all of the time).

I just tried to get a single sample and the smallest run I could get was
about three seconds with four samples taken.

Or, I also tried xentop in batch mode:

jaydm@phoebe:~$ sudo xentop -b -i 1
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID
  Domain-0 -----r     430567    0.0    3939328   23.5   no limit       n/a     8    4        0        0    0        0        0        0 2149631536
     tck01 --b---     750449    0.0    3145728   18.8    3145728      18.8     2    1   483054  1855493    1       15   655667  8445829 2149631536
     tck02 --b---    1101273    0.0    3145728   18.8    3145728      18.8     2    1   367792  1773407    1       83  1131709  9030663 2149631536
     tck03 -----r     144552    0.0    3145728   18.8    3145728      18.8     2    1   188115  2370069    1        6   370431  1290683 2149631536
     tck04 --b---     103742    0.0    3145728   18.8    3145728      18.8     2    1   286936  2341941    1        7   381523  1484476 2149631536

It looks to me like having a cron job that periodically ran xentop and
build up a history would be the best option (without digging through
a ton of different specialized monitor packages).


Jay

Kevan Miller wrote:
> 
> On Oct 10, 2008, at 11:29 AM, Kevan Miller wrote:
> 
>>
>> On Oct 10, 2008, at 11:25 AM, Kevan Miller wrote:
>>
>>>
>>> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>>>
>>>>
>>>> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>>>>
>>>>> We had some suggestions earlier for some alternate means of
>>>>> implementing this (Hudson, Conitnuum, etc...).  Now that we've had
>>>>> Jason Dillon provide an overview of what we had in place before,
>>>>> does anyone have thoughts on what we should go with?  I'm thinking
>>>>> we should stick with the AHP based solution.  It will need to be
>>>>> updated most likely, but it's been tried and tested and shown to
>>>>> meet our needs.  I'm wondering, though, why we stopped using it
>>>>> before.  Was there a specific issue we're going to have to deal
>>>>> with again?
>>>>
>>>> IIRC, the overwhelming reason we stopped using it before was because
>>>> of hosting issues -- spotty networking, hardware failures, poor colo
>>>> support, etc. We shouldn't have any of these problems, now. If we do
>>>> run into problems, they should now be fixable. I have no reason to
>>>> favor Hudson/Continuum over AHP. So, if we can get AHP running
>>>> easily, I'm all for it. There's only one potential issue, that I'm
>>>> aware of.
>>>>
>>>> We previously had an Open Source License issued for our use of
>>>> Anthill. Here's some of the old discussion --
>>>> http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>>>>
>>>>
>>>> Although the board was aware of our usage of AntHill, since we
>>>> weren't running AntHill on ASF hardware, I'm not sure the license
>>>> was fully vetted by Infra. I don't see any issues, but I'll want to
>>>> run this by Infra.
>>>>
>>>> Jason D, will the existing license cover the version of AntHill that
>>>> we'll want to use? I'll run the license by Infra and will also
>>>> describe the issue for review by the Board, in our quarterly report.
> 
> Heh. Oops. Just noticed that I sent the following to myself and not the
> dev list. I hate when I do that...
> 
>>
>> One more thing... from emails on infrastructure@apache.org looks like
>> Infra is cool with us running Anthill on selene and phoebe.
>>
>> BTW, am planning on installing monitoring software over the weekend on
>> selene and phoebe. The board is interested in monitoring our usage...
> 
> 
> Also, we now have a new AntHill license for our use. I've placed the
> license in ~kevan/License2.txt on phoebe and selene. This license should
> only be used for Apache use. So, should not be placed in a public
> location (e.g.  our public svn tree).
> 
> Regarding monitoring software -- I haven't been able to get it to work
> yet. vmstat/iostat don't work, unless you run on every virtual machine.
> 'xm top' gathers data on all domains, however, doesn't make the data
> easy to tuck away in a log file/available to snmp... Advice welcome...
> 
> --kevan
> 

Re: Continuous TCK Testing

Posted by Kevan Miller <ke...@gmail.com>.
On Oct 10, 2008, at 11:29 AM, Kevan Miller wrote:

>
> On Oct 10, 2008, at 11:25 AM, Kevan Miller wrote:
>
>>
>> On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:
>>
>>>
>>> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>>>
>>>> We had some suggestions earlier for some alternate means of  
>>>> implementing this (Hudson, Conitnuum, etc...).  Now that we've  
>>>> had Jason Dillon provide an overview of what we had in place  
>>>> before, does anyone have thoughts on what we should go with?  I'm  
>>>> thinking we should stick with the AHP based solution.  It will  
>>>> need to be updated most likely, but it's been tried and tested  
>>>> and shown to meet our needs.  I'm wondering, though, why we  
>>>> stopped using it before.  Was there a specific issue we're going  
>>>> to have to deal with again?
>>>
>>> IIRC, the overwhelming reason we stopped using it before was  
>>> because of hosting issues -- spotty networking, hardware failures,  
>>> poor colo support, etc. We shouldn't have any of these problems,  
>>> now. If we do run into problems, they should now be fixable. I  
>>> have no reason to favor Hudson/Continuum over AHP. So, if we can  
>>> get AHP running easily, I'm all for it. There's only one potential  
>>> issue, that I'm aware of.
>>>
>>> We previously had an Open Source License issued for our use of  
>>> Anthill. Here's some of the old discussion -- http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>>>
>>> Although the board was aware of our usage of AntHill, since we  
>>> weren't running AntHill on ASF hardware, I'm not sure the license  
>>> was fully vetted by Infra. I don't see any issues, but I'll want  
>>> to run this by Infra.
>>>
>>> Jason D, will the existing license cover the version of AntHill  
>>> that we'll want to use? I'll run the license by Infra and will  
>>> also describe the issue for review by the Board, in our quarterly  
>>> report.

Heh. Oops. Just noticed that I sent the following to myself and not  
the dev list. I hate when I do that...

>
> One more thing... from emails on infrastructure@apache.org looks  
> like Infra is cool with us running Anthill on selene and phoebe.
>
> BTW, am planning on installing monitoring software over the weekend  
> on selene and phoebe. The board is interested in monitoring our  
> usage...


Also, we now have a new AntHill license for our use. I've placed the  
license in ~kevan/License2.txt on phoebe and selene. This license  
should only be used for Apache use. So, should not be placed in a  
public location (e.g.  our public svn tree).

Regarding monitoring software -- I haven't been able to get it to work  
yet. vmstat/iostat don't work, unless you run on every virtual  
machine. 'xm top' gathers data on all domains, however, doesn't make  
the data easy to tuck away in a log file/available to snmp... Advice  
welcome...

--kevan

Re: Continuous TCK Testing

Posted by Kevan Miller <ke...@gmail.com>.
On Oct 8, 2008, at 11:56 PM, Kevan Miller wrote:

>
> On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:
>
>> We had some suggestions earlier for some alternate means of  
>> implementing this (Hudson, Conitnuum, etc...).  Now that we've had  
>> Jason Dillon provide an overview of what we had in place before,  
>> does anyone have thoughts on what we should go with?  I'm thinking  
>> we should stick with the AHP based solution.  It will need to be  
>> updated most likely, but it's been tried and tested and shown to  
>> meet our needs.  I'm wondering, though, why we stopped using it  
>> before.  Was there a specific issue we're going to have to deal  
>> with again?
>
> IIRC, the overwhelming reason we stopped using it before was because  
> of hosting issues -- spotty networking, hardware failures, poor colo  
> support, etc. We shouldn't have any of these problems, now. If we do  
> run into problems, they should now be fixable. I have no reason to  
> favor Hudson/Continuum over AHP. So, if we can get AHP running  
> easily, I'm all for it. There's only one potential issue, that I'm  
> aware of.
>
> We previously had an Open Source License issued for our use of  
> Anthill. Here's some of the old discussion -- http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902
>
> Although the board was aware of our usage of AntHill, since we  
> weren't running AntHill on ASF hardware, I'm not sure the license  
> was fully vetted by Infra. I don't see any issues, but I'll want to  
> run this by Infra.
>
> Jason D, will the existing license cover the version of AntHill that  
> we'll want to use? I'll run the license by Infra and will also  
> describe the issue for review by the Board, in our quarterly report.
>
> IMO, I'd proceed with the assumption that we'll be using AHP. Just  
> don't install it on Apache hardware, yet.

I've requested a new license from Anthill. Will let you know when I  
get it.

--kevan


Re: Continuous TCK Testing

Posted by Kevan Miller <ke...@gmail.com>.
On Oct 8, 2008, at 4:31 PM, Jason Warner wrote:

> We had some suggestions earlier for some alternate means of  
> implementing this (Hudson, Conitnuum, etc...).  Now that we've had  
> Jason Dillon provide an overview of what we had in place before,  
> does anyone have thoughts on what we should go with?  I'm thinking  
> we should stick with the AHP based solution.  It will need to be  
> updated most likely, but it's been tried and tested and shown to  
> meet our needs.  I'm wondering, though, why we stopped using it  
> before.  Was there a specific issue we're going to have to deal with  
> again?

IIRC, the overwhelming reason we stopped using it before was because  
of hosting issues -- spotty networking, hardware failures, poor colo  
support, etc. We shouldn't have any of these problems, now. If we do  
run into problems, they should now be fixable. I have no reason to  
favor Hudson/Continuum over AHP. So, if we can get AHP running easily,  
I'm all for it. There's only one potential issue, that I'm aware of.

We previously had an Open Source License issued for our use of  
Anthill. Here's some of the old discussion -- http://www.nabble.com/Geronimo-build-automation-status-(longish)-tt7649902.html#a7649902

Although the board was aware of our usage of AntHill, since we weren't  
running AntHill on ASF hardware, I'm not sure the license was fully  
vetted by Infra. I don't see any issues, but I'll want to run this by  
Infra.

Jason D, will the existing license cover the version of AntHill that  
we'll want to use? I'll run the license by Infra and will also  
describe the issue for review by the Board, in our quarterly report.

IMO, I'd proceed with the assumption that we'll be using AHP. Just  
don't install it on Apache hardware, yet.

--kevan

Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
We had some suggestions earlier for some alternate means of implementing
this (Hudson, Conitnuum, etc...).  Now that we've had Jason Dillon provide
an overview of what we had in place before, does anyone have thoughts on
what we should go with?  I'm thinking we should stick with the AHP based
solution.  It will need to be updated most likely, but it's been tried and
tested and shown to meet our needs.  I'm wondering, though, why we stopped
using it before.  Was there a specific issue we're going to have to deal
with again?

Thanks,

On Wed, Oct 8, 2008 at 12:05 PM, Jason Warner <ja...@gmail.com> wrote:

> Here's a quick question.  Where does AHP come from?
>
> On Mon, Oct 6, 2008 at 1:18 PM, Jason Dillon <ja...@gmail.com>wrote:
>
>> Sure np, took me a while to get around to writing it too ;-)
>> --jason
>>
>>
>> On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:
>>
>> Just got around to reading this.  Thanks for the brain dump, Jason.  No
>> questions as of yet, but I'm sure I'll need a few more reads before I
>> understand it all.
>>
>> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <ja...@gmail.com>wrote:
>>
>>> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>>>
>>>  Is the GBuild stuff in svn the same as the anthill-based code or is that
>>>> something different?  GBuild seems to have scripts for running tck and that
>>>> leads me to think they're the same thing, but I see no mention of anthill in
>>>> the code.
>>>>
>>>
>>> The Anthill stuff is completely different than the GBuild stuff.  I
>>> started out trying to get the TCK automated using GBuild, but decided that
>>> the system lacked too many features to perform as I desired, and went ahead
>>> with Anthill as it did pretty much everything, though had some stability
>>> problems.
>>>
>>> One of the main reasons why I choose Anthill (AHP, Anthill Pro that is)
>>> was its build agent and code repository systems.  This allowed me to ensure
>>> that each build used exactly the desired artifacts.  Another was the
>>> configurable workflow, which allowed me to create a custom chain of events
>>> to handle running builds on remote agents and control what data gets set to
>>> them, what it will collect and what logic to execute once all distributed
>>> work has been completed for a particular build.  And the kicker which help
>>> facilitate bringing it all together was its concept of a build life.
>>>
>>> At the time I could find *no other* build tool which could meet all of
>>> these needs, and so I went with AHP instead of spending months
>>> building/testing features in GBuild.
>>>
>>> While AHP supports configuring a lot of stuff via its web-interface, I
>>> found that it was very cumbersome, so I opted to write some glue, which was
>>> stored in svn here:
>>>
>>>
>>> https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>>>
>>> Its been a while, so I have to refresh my memory on how this stuff
>>> actually worked.  First let me explain about the code repository (what it
>>> calls codestation) and why it was critical to the TCK testing IMO.  When we
>>> use Maven normally, it pulls data from a set of external repositories, picks
>>> up more repositories from the stuff it downloads and quickly we loose
>>> control where stuff comes from.  After it pulls down all that stuff, it
>>> churns though a build and spits out the stuff we care about, normally
>>> stuffing them (via mvn install) into the local repository.
>>>
>>> AHP supports by default tasks to publish artifacts (really just a set of
>>> files controlled by an Ant-like include/exclude path) from a build agent
>>> into Codestation, as well as tasks to resolve artifacts (ie. download them
>>> from Codestation to the local working directory on the build agents system).
>>>  Each top-level build in AHP gets assigned a new (empty) build life.
>>>  Artifacts are always published to/resolved from a build life, either that
>>> of the current build, or of a dependency build.
>>>
>>> So what I did was I setup builds for Geronimo Server (the normal
>>> server/trunk stuff), which did the normal mvn install thingy, but I always
>>> gave it a custom -Dmaven.local.repository which resolved to something inside
>>> the working directory for the running build.  The build was still online, so
>>> it pulled down a bunch of stuff into an empty local repository (so it was a
>>> clean build wrt the repository, as well as the source code, which was always
>>> fetched for each new build).  Once the build had finished, I used the
>>> artifact publisher task to push *all* of the stuff in the local repository
>>> into Codestation, labled as something like "Maven repository artifacts" for
>>> the current build life.
>>>
>>> Then I setup another build for Apache Geronimo CTS Server (the
>>> porting/branches/* stuff).  This build was dependent upon the "Maven
>>> repository artifacts" of the Geronimo Server build, and I configured those
>>> artifacts to get installed on the build agents system in the same directory
>>> that I configured the CTS Server build to use for its local maven
>>> repository.  So again the repo started out empty, then got populated with
>>> all of the outputs from the normal G build, and then the cts-server build
>>> was started.  The build of the components and assemblies is normally fairly
>>> quick and aside from some stuff in the private tck repo won't download muck
>>> more stuff, because it already had most of its dependencies installed via
>>> the Codestation dependency resolution.   Once the build finished, I
>>> published to cts-server assembly artifacts back to Codestation under like
>>> "CTS Server Assemblies" or something.
>>>
>>> Up until this point its normal builds, but now we have built the G
>>> server, then built the CTS server (using the *exact* artifacts from the G
>>> server build, even though each might have happened on a different build
>>> agent).  And now we need to go and run a bunch of tests, using the *exact*
>>> CTS server assemblies, produce some output, collect it, and once all of the
>>> tests are done render some nice reports, etc.
>>>
>>> AHP supports setting up builds which contain "parallel" tasks, each of
>>> those tasks is then performed by a build agent, they have fancy build agent
>>> selection stuff, but for my needs I had basically 2 groups, one group for
>>> running the server builds, and then another for running the tests.  I only
>>> set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to
>>> mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with
>>> about 10-12 Xen virtual machines running internally to run build agents.
>>>  Each system also had a RAID-0 array setup over 4 disks to help reduce disk
>>> io wait, which was as I found out the limiting factor when trying to run a
>>> ton of builds that all checkout and download artifacts and such.
>>>
>>> I helped the AHP team add a new feature which was an parallel iterator
>>> task, so you define *one* task that internally fires off n parallel tasks,
>>> which would set the iteration number, and leave it up to the build logic to
>>> pick what to do based on that index.  The alternative was a unwieldy set of
>>> like 200 tasks in their UI which simply didn't work at all.  You might have
>>> notice an "iterations.xml" file in the tck-testsuite directory, this was was
>>> was used to take an iteration number and turn it into what tests we actually
>>> run.  The <iteration> bits are order sensitive in that file.
>>>
>>> Soooo, after we have a CTS Server for a particular G Server build, we can
>>> no go an do "runtests" for a specific set of tests (defined by an
>>> iteration)... this differed from the other builds above a little, but still
>>> pulled down artifacts, the CTS Server assemblies (only the assemblies and
>>> the required bits to run the geronimo-maven-plugin, which was used to
>>> geronimo:install, as well as used by the tck itself to fire up the server
>>> and so on).  The key thing here, with regards to the maven configuration
>>> (besides using that custom Codestation populated repository) was that the
>>> builds were run *offline*.
>>>
>>> After runtests completed, the results are then soaked up (the stuff that
>>> javatest pukes out with icky details, as well as the full log files and
>>> other stuff I can recall) and then pushed back into Codestation.
>>>
>>> Once all of the iterations were finished, another task fires off which
>>> generates a report.  It does this by downloading from Codestation all of the
>>> runtests outputs (each was zipped I think), unzips them one by one, run some
>>> custom goo I wrote (based some of the concepts from original stuff from the
>>> GBuild-based TCK automation), and generates a nice Javadoc-like report that
>>> includes all of the gory details.
>>>
>>> I can't remember how long I spent working on this... too long (not the
>>> reports I mean, the whole system).  But in the end I recall something like
>>> running an entire TCK testsuite for a single server configuration (like
>>> jetty) in about 4-6 hours... I sent mail to the list with the results, so if
>>> you are curious what the real number is, instead of my guess, you can look
>>> for it there.  But anyway it was damn quick running on just those 2
>>> machines.  And I *knew* exactly that each of the distributed tests was
>>> actually testing a known build that I could trace back to its artifacts and
>>> then back to its SVN revision, without worrying about mvn downloading
>>> something new when midnight rolled over or that a new G server or CTS server
>>> build that might be in progress hasn't compromised the testing by polluting
>>> the local repository.
>>>
>>>  * * *
>>>
>>> So, about the sandbox/build-support stuff...
>>>
>>> First there is the 'harness' project, which is rather small, but contains
>>> the basic stuff, like a version of ant and maven which all of these builds
>>> would use, some other internal glue, a  fix for an evil Maven problem
>>> causing erroneous build failures due to some internal thread state
>>> corruption or gremlins, not sure which.  I kinda used this project to help
>>> manage the software needed by normal builds, which is why Ant and Maven were
>>> in there... ie. so I didn't have to go install it on each agent each time it
>>> changed, just let the AHP system deal with it for me.
>>>
>>> This was setup as a normal AHP project, built using its internal Ant
>>> builder (though having that builder configured still to use the local
>>> version it pulled from SVN to ensure it always works.
>>>
>>> Each other build was setup to depend on the output artifacts from the
>>> build harness build, using the latest in a range, like say using "3.*" for
>>> the latest 3.x build (which looks like that was 3.7).  This let me work on
>>> new stuff w/o breaking the current builds as I hacked things up.
>>>
>>> So, in addition to all of the stuff I mentioned above wrt the G and CTS
>>> builds, each also had this step which resolved the build harness artifacts
>>> to that working directory, and the Maven builds were always run via the
>>> version of Maven included from the harness.  But, AHP didn't actually run
>>> that version of Maven directly, it used its internal Ant task to execute the
>>> version of Ant from the harness *and* use the harness.xml buildfile.
>>>
>>> The harness.xml stuff is some more goo which I wrote to help mange AHP
>>> configurations.  With AHP (at that time, not sure if it has changed) you had
>>> to do most everything via the web UI, which sucked, and it was hard to
>>> refactor sets of projects and so on.  So I came up with a standard set of
>>> tasks to execute for a project, then put all of the custom muck I needed
>>> into what I called a _library_ and then had the AHP via harness.xml invoke
>>> it with some configuration about what project it was and other build
>>> details.
>>>
>>> The actual harness.xml is not very big, it simply makes sure that */bin/*
>>> is executable (codestation couldn't preserve execute bits), uses the
>>> Codestation command-line client (invoking the javaclass directly though) to
>>> ask the repository to resolve artifacts from the "Build Library" to the
>>> local repository.  I had this artifact resolution separate from the normal
>>> dependency (or harness) artifact resolution so that it was easier for me to
>>> fix problems with the library while a huge set of TCK iterations were still
>>> queued up to run.  Basically, if I noticed a problem due to a code or
>>> configuration issue in an early build, I could fix it, and use the existing
>>> builds to verify the fix, instead of wasting an hour (sometimes more
>>> depending on networking problems accessing remote repos while building the
>>> servers) to rebuild and start over.
>>>
>>> This brings us to the 'libraries' project.  In general the idea of a
>>> _library_ was just a named/versioned collection of files, where you could be
>>> used by a project.  The main (er only) library defined in this SVN is
>>> system/.  This is the groovy glue which made everything work.  This is where
>>> the entry-point class is located (the guy who gets invoked via harness.xml
>>> via:
>>>
>>>    <target name="harness" depends="init">
>>>        <groovy>
>>>            <classpath>
>>>                <pathelement location="${library.basedir}/groovy"/>
>>>            </classpath>
>>>
>>>            gbuild.system.BuildHarness.bootstrap(this)
>>>        </groovy>
>>>    </target>
>>>
>>> I won't go into too much detail on this stuff now, take a look at it and
>>> ask questions.  But, basically there is stuff in gbuild.system.* which is
>>> harness support muck, and stuff in gbuild.config.* which contains
>>> configuration.  I was kinda mid-refactoring of some things, starting to add
>>> new features, not sure where I left off actually. But the key bits are in
>>> gbuild.config.project.*  This contains a package for each project, with the
>>> package name being the same as the AHP project (with " " -> "_"). And then
>>> in each of those package is at least a Controller.groovy class (or other
>>> classes if special muck was needed, like for the report generation in
>>> Geronimo_CTS, etc).
>>>
>>> The controller defines a set of actions, implemented as Groovy closures
>>> bound to properties of the Controller class.  One of the properties passed
>>> in from the AHP configuration (configured via the Web UI, passed to the
>>> harness.xml build, and then on to the Groovy harness) was the name of the
>>> _action_ to execute.  Most of that stuff should be fairly straightforward.
>>>
>>> So after a build is started (maybe from a Web UI click, or SVN change
>>> detection, or a TCK runtests iteration) the following happens (in simplified
>>> terms):
>>>
>>>  * Agent starts build
>>>  * Agent cleans its working directory
>>>  * Agent downloads the build harness
>>>  * Agent downloads any dependencies
>>>  * Agent invoke Ant on harness.xml passing in some details
>>>  * Harness.xml downloads the system/1 library
>>>  * Harness.xml runs gbuild.system.BuildHarness
>>>  * BuildHarness tries to construct a Controller instance for the project
>>>  * BuildHarness tries to find Controller action to execute
>>>  * BuildHarness executes the Controller action
>>>  * Agent publishes output artifacts
>>>  * Agent completes build
>>>
>>> A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff
>>> we get from Sun to execute.  This stuff isn't small, but is for the most
>>> part read-only.  So I setup a location on each build agent where these files
>>> were installed to.  I created AHP projects to manage them and treated them
>>> like a special "library" one which tried really hard not to go fetch its
>>> content unless the local content was out of date.  This helped speed up the
>>> entire build process... cause that delete/download of all that muck really
>>> slows down 20 agents running in parallel on 2 big machines with stripped
>>> array.  For legal reasons this stuff was not kept in svn.apache.org's
>>> main repository, and for logistical reasons wasn't kept in the private tck
>>> repo on svn.apache.org either.  Because there were so many files, and be
>>> case the httpd configuration on svn.apache.org kicks out requests that
>>> it thinks are *bunk* to help save the resources for the community, I had
>>> setup a private ssl secured private svn repository on the old gbuild.orgmachines to put in the full muck required, then setup some goo in the
>>> harness to resolve them.  This goo is all in gbuild.system.library.*  See
>>> the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was
>>> actually used.
>>>
>>>  * * *
>>>
>>> Okay, that is about all the brain-dump for TCK muck I have in me for
>>> tonight.  Reply with questions if you have any.
>>>
>>> Cheers,
>>>
>>> --jason
>>>
>>>
>>>
>>
>>
>> --
>> ~Jason Warner
>>
>>
>>
>
>
> --
> ~Jason Warner
>



-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
Here's a quick question.  Where does AHP come from?

On Mon, Oct 6, 2008 at 1:18 PM, Jason Dillon <ja...@gmail.com> wrote:

> Sure np, took me a while to get around to writing it too ;-)
> --jason
>
>
> On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:
>
> Just got around to reading this.  Thanks for the brain dump, Jason.  No
> questions as of yet, but I'm sure I'll need a few more reads before I
> understand it all.
>
> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <ja...@gmail.com>wrote:
>
>> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>>
>>  Is the GBuild stuff in svn the same as the anthill-based code or is that
>>> something different?  GBuild seems to have scripts for running tck and that
>>> leads me to think they're the same thing, but I see no mention of anthill in
>>> the code.
>>>
>>
>> The Anthill stuff is completely different than the GBuild stuff.  I
>> started out trying to get the TCK automated using GBuild, but decided that
>> the system lacked too many features to perform as I desired, and went ahead
>> with Anthill as it did pretty much everything, though had some stability
>> problems.
>>
>> One of the main reasons why I choose Anthill (AHP, Anthill Pro that is)
>> was its build agent and code repository systems.  This allowed me to ensure
>> that each build used exactly the desired artifacts.  Another was the
>> configurable workflow, which allowed me to create a custom chain of events
>> to handle running builds on remote agents and control what data gets set to
>> them, what it will collect and what logic to execute once all distributed
>> work has been completed for a particular build.  And the kicker which help
>> facilitate bringing it all together was its concept of a build life.
>>
>> At the time I could find *no other* build tool which could meet all of
>> these needs, and so I went with AHP instead of spending months
>> building/testing features in GBuild.
>>
>> While AHP supports configuring a lot of stuff via its web-interface, I
>> found that it was very cumbersome, so I opted to write some glue, which was
>> stored in svn here:
>>
>>
>> https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>>
>> Its been a while, so I have to refresh my memory on how this stuff
>> actually worked.  First let me explain about the code repository (what it
>> calls codestation) and why it was critical to the TCK testing IMO.  When we
>> use Maven normally, it pulls data from a set of external repositories, picks
>> up more repositories from the stuff it downloads and quickly we loose
>> control where stuff comes from.  After it pulls down all that stuff, it
>> churns though a build and spits out the stuff we care about, normally
>> stuffing them (via mvn install) into the local repository.
>>
>> AHP supports by default tasks to publish artifacts (really just a set of
>> files controlled by an Ant-like include/exclude path) from a build agent
>> into Codestation, as well as tasks to resolve artifacts (ie. download them
>> from Codestation to the local working directory on the build agents system).
>>  Each top-level build in AHP gets assigned a new (empty) build life.
>>  Artifacts are always published to/resolved from a build life, either that
>> of the current build, or of a dependency build.
>>
>> So what I did was I setup builds for Geronimo Server (the normal
>> server/trunk stuff), which did the normal mvn install thingy, but I always
>> gave it a custom -Dmaven.local.repository which resolved to something inside
>> the working directory for the running build.  The build was still online, so
>> it pulled down a bunch of stuff into an empty local repository (so it was a
>> clean build wrt the repository, as well as the source code, which was always
>> fetched for each new build).  Once the build had finished, I used the
>> artifact publisher task to push *all* of the stuff in the local repository
>> into Codestation, labled as something like "Maven repository artifacts" for
>> the current build life.
>>
>> Then I setup another build for Apache Geronimo CTS Server (the
>> porting/branches/* stuff).  This build was dependent upon the "Maven
>> repository artifacts" of the Geronimo Server build, and I configured those
>> artifacts to get installed on the build agents system in the same directory
>> that I configured the CTS Server build to use for its local maven
>> repository.  So again the repo started out empty, then got populated with
>> all of the outputs from the normal G build, and then the cts-server build
>> was started.  The build of the components and assemblies is normally fairly
>> quick and aside from some stuff in the private tck repo won't download muck
>> more stuff, because it already had most of its dependencies installed via
>> the Codestation dependency resolution.   Once the build finished, I
>> published to cts-server assembly artifacts back to Codestation under like
>> "CTS Server Assemblies" or something.
>>
>> Up until this point its normal builds, but now we have built the G server,
>> then built the CTS server (using the *exact* artifacts from the G server
>> build, even though each might have happened on a different build agent).
>>  And now we need to go and run a bunch of tests, using the *exact* CTS
>> server assemblies, produce some output, collect it, and once all of the
>> tests are done render some nice reports, etc.
>>
>> AHP supports setting up builds which contain "parallel" tasks, each of
>> those tasks is then performed by a build agent, they have fancy build agent
>> selection stuff, but for my needs I had basically 2 groups, one group for
>> running the server builds, and then another for running the tests.  I only
>> set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to
>> mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with
>> about 10-12 Xen virtual machines running internally to run build agents.
>>  Each system also had a RAID-0 array setup over 4 disks to help reduce disk
>> io wait, which was as I found out the limiting factor when trying to run a
>> ton of builds that all checkout and download artifacts and such.
>>
>> I helped the AHP team add a new feature which was an parallel iterator
>> task, so you define *one* task that internally fires off n parallel tasks,
>> which would set the iteration number, and leave it up to the build logic to
>> pick what to do based on that index.  The alternative was a unwieldy set of
>> like 200 tasks in their UI which simply didn't work at all.  You might have
>> notice an "iterations.xml" file in the tck-testsuite directory, this was was
>> was used to take an iteration number and turn it into what tests we actually
>> run.  The <iteration> bits are order sensitive in that file.
>>
>> Soooo, after we have a CTS Server for a particular G Server build, we can
>> no go an do "runtests" for a specific set of tests (defined by an
>> iteration)... this differed from the other builds above a little, but still
>> pulled down artifacts, the CTS Server assemblies (only the assemblies and
>> the required bits to run the geronimo-maven-plugin, which was used to
>> geronimo:install, as well as used by the tck itself to fire up the server
>> and so on).  The key thing here, with regards to the maven configuration
>> (besides using that custom Codestation populated repository) was that the
>> builds were run *offline*.
>>
>> After runtests completed, the results are then soaked up (the stuff that
>> javatest pukes out with icky details, as well as the full log files and
>> other stuff I can recall) and then pushed back into Codestation.
>>
>> Once all of the iterations were finished, another task fires off which
>> generates a report.  It does this by downloading from Codestation all of the
>> runtests outputs (each was zipped I think), unzips them one by one, run some
>> custom goo I wrote (based some of the concepts from original stuff from the
>> GBuild-based TCK automation), and generates a nice Javadoc-like report that
>> includes all of the gory details.
>>
>> I can't remember how long I spent working on this... too long (not the
>> reports I mean, the whole system).  But in the end I recall something like
>> running an entire TCK testsuite for a single server configuration (like
>> jetty) in about 4-6 hours... I sent mail to the list with the results, so if
>> you are curious what the real number is, instead of my guess, you can look
>> for it there.  But anyway it was damn quick running on just those 2
>> machines.  And I *knew* exactly that each of the distributed tests was
>> actually testing a known build that I could trace back to its artifacts and
>> then back to its SVN revision, without worrying about mvn downloading
>> something new when midnight rolled over or that a new G server or CTS server
>> build that might be in progress hasn't compromised the testing by polluting
>> the local repository.
>>
>>  * * *
>>
>> So, about the sandbox/build-support stuff...
>>
>> First there is the 'harness' project, which is rather small, but contains
>> the basic stuff, like a version of ant and maven which all of these builds
>> would use, some other internal glue, a  fix for an evil Maven problem
>> causing erroneous build failures due to some internal thread state
>> corruption or gremlins, not sure which.  I kinda used this project to help
>> manage the software needed by normal builds, which is why Ant and Maven were
>> in there... ie. so I didn't have to go install it on each agent each time it
>> changed, just let the AHP system deal with it for me.
>>
>> This was setup as a normal AHP project, built using its internal Ant
>> builder (though having that builder configured still to use the local
>> version it pulled from SVN to ensure it always works.
>>
>> Each other build was setup to depend on the output artifacts from the
>> build harness build, using the latest in a range, like say using "3.*" for
>> the latest 3.x build (which looks like that was 3.7).  This let me work on
>> new stuff w/o breaking the current builds as I hacked things up.
>>
>> So, in addition to all of the stuff I mentioned above wrt the G and CTS
>> builds, each also had this step which resolved the build harness artifacts
>> to that working directory, and the Maven builds were always run via the
>> version of Maven included from the harness.  But, AHP didn't actually run
>> that version of Maven directly, it used its internal Ant task to execute the
>> version of Ant from the harness *and* use the harness.xml buildfile.
>>
>> The harness.xml stuff is some more goo which I wrote to help mange AHP
>> configurations.  With AHP (at that time, not sure if it has changed) you had
>> to do most everything via the web UI, which sucked, and it was hard to
>> refactor sets of projects and so on.  So I came up with a standard set of
>> tasks to execute for a project, then put all of the custom muck I needed
>> into what I called a _library_ and then had the AHP via harness.xml invoke
>> it with some configuration about what project it was and other build
>> details.
>>
>> The actual harness.xml is not very big, it simply makes sure that */bin/*
>> is executable (codestation couldn't preserve execute bits), uses the
>> Codestation command-line client (invoking the javaclass directly though) to
>> ask the repository to resolve artifacts from the "Build Library" to the
>> local repository.  I had this artifact resolution separate from the normal
>> dependency (or harness) artifact resolution so that it was easier for me to
>> fix problems with the library while a huge set of TCK iterations were still
>> queued up to run.  Basically, if I noticed a problem due to a code or
>> configuration issue in an early build, I could fix it, and use the existing
>> builds to verify the fix, instead of wasting an hour (sometimes more
>> depending on networking problems accessing remote repos while building the
>> servers) to rebuild and start over.
>>
>> This brings us to the 'libraries' project.  In general the idea of a
>> _library_ was just a named/versioned collection of files, where you could be
>> used by a project.  The main (er only) library defined in this SVN is
>> system/.  This is the groovy glue which made everything work.  This is where
>> the entry-point class is located (the guy who gets invoked via harness.xml
>> via:
>>
>>    <target name="harness" depends="init">
>>        <groovy>
>>            <classpath>
>>                <pathelement location="${library.basedir}/groovy"/>
>>            </classpath>
>>
>>            gbuild.system.BuildHarness.bootstrap(this)
>>        </groovy>
>>    </target>
>>
>> I won't go into too much detail on this stuff now, take a look at it and
>> ask questions.  But, basically there is stuff in gbuild.system.* which is
>> harness support muck, and stuff in gbuild.config.* which contains
>> configuration.  I was kinda mid-refactoring of some things, starting to add
>> new features, not sure where I left off actually. But the key bits are in
>> gbuild.config.project.*  This contains a package for each project, with the
>> package name being the same as the AHP project (with " " -> "_"). And then
>> in each of those package is at least a Controller.groovy class (or other
>> classes if special muck was needed, like for the report generation in
>> Geronimo_CTS, etc).
>>
>> The controller defines a set of actions, implemented as Groovy closures
>> bound to properties of the Controller class.  One of the properties passed
>> in from the AHP configuration (configured via the Web UI, passed to the
>> harness.xml build, and then on to the Groovy harness) was the name of the
>> _action_ to execute.  Most of that stuff should be fairly straightforward.
>>
>> So after a build is started (maybe from a Web UI click, or SVN change
>> detection, or a TCK runtests iteration) the following happens (in simplified
>> terms):
>>
>>  * Agent starts build
>>  * Agent cleans its working directory
>>  * Agent downloads the build harness
>>  * Agent downloads any dependencies
>>  * Agent invoke Ant on harness.xml passing in some details
>>  * Harness.xml downloads the system/1 library
>>  * Harness.xml runs gbuild.system.BuildHarness
>>  * BuildHarness tries to construct a Controller instance for the project
>>  * BuildHarness tries to find Controller action to execute
>>  * BuildHarness executes the Controller action
>>  * Agent publishes output artifacts
>>  * Agent completes build
>>
>> A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff
>> we get from Sun to execute.  This stuff isn't small, but is for the most
>> part read-only.  So I setup a location on each build agent where these files
>> were installed to.  I created AHP projects to manage them and treated them
>> like a special "library" one which tried really hard not to go fetch its
>> content unless the local content was out of date.  This helped speed up the
>> entire build process... cause that delete/download of all that muck really
>> slows down 20 agents running in parallel on 2 big machines with stripped
>> array.  For legal reasons this stuff was not kept in svn.apache.org's
>> main repository, and for logistical reasons wasn't kept in the private tck
>> repo on svn.apache.org either.  Because there were so many files, and be
>> case the httpd configuration on svn.apache.org kicks out requests that it
>> thinks are *bunk* to help save the resources for the community, I had setup
>> a private ssl secured private svn repository on the old gbuild.orgmachines to put in the full muck required, then setup some goo in the
>> harness to resolve them.  This goo is all in gbuild.system.library.*  See
>> the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was
>> actually used.
>>
>>  * * *
>>
>> Okay, that is about all the brain-dump for TCK muck I have in me for
>> tonight.  Reply with questions if you have any.
>>
>> Cheers,
>>
>> --jason
>>
>>
>>
>
>
> --
> ~Jason Warner
>
>
>


-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
Sure np, took me a while to get around to writing it too ;-)

--jason


On Oct 6, 2008, at 10:24 PM, Jason Warner wrote:

> Just got around to reading this.  Thanks for the brain dump, Jason.   
> No questions as of yet, but I'm sure I'll need a few more reads  
> before I understand it all.
>
> On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon  
> <ja...@gmail.com> wrote:
> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>
> Is the GBuild stuff in svn the same as the anthill-based code or is  
> that something different?  GBuild seems to have scripts for running  
> tck and that leads me to think they're the same thing, but I see no  
> mention of anthill in the code.
>
> The Anthill stuff is completely different than the GBuild stuff.  I  
> started out trying to get the TCK automated using GBuild, but  
> decided that the system lacked too many features to perform as I  
> desired, and went ahead with Anthill as it did pretty much  
> everything, though had some stability problems.
>
> One of the main reasons why I choose Anthill (AHP, Anthill Pro that  
> is) was its build agent and code repository systems.  This allowed  
> me to ensure that each build used exactly the desired artifacts.   
> Another was the configurable workflow, which allowed me to create a  
> custom chain of events to handle running builds on remote agents and  
> control what data gets set to them, what it will collect and what  
> logic to execute once all distributed work has been completed for a  
> particular build.  And the kicker which help facilitate bringing it  
> all together was its concept of a build life.
>
> At the time I could find *no other* build tool which could meet all  
> of these needs, and so I went with AHP instead of spending months  
> building/testing features in GBuild.
>
> While AHP supports configuring a lot of stuff via its web-interface,  
> I found that it was very cumbersome, so I opted to write some glue,  
> which was stored in svn here:
>
>    https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>
> Its been a while, so I have to refresh my memory on how this stuff  
> actually worked.  First let me explain about the code repository  
> (what it calls codestation) and why it was critical to the TCK  
> testing IMO.  When we use Maven normally, it pulls data from a set  
> of external repositories, picks up more repositories from the stuff  
> it downloads and quickly we loose control where stuff comes from.   
> After it pulls down all that stuff, it churns though a build and  
> spits out the stuff we care about, normally stuffing them (via mvn  
> install) into the local repository.
>
> AHP supports by default tasks to publish artifacts (really just a  
> set of files controlled by an Ant-like include/exclude path) from a  
> build agent into Codestation, as well as tasks to resolve artifacts  
> (ie. download them from Codestation to the local working directory  
> on the build agents system).  Each top-level build in AHP gets  
> assigned a new (empty) build life.  Artifacts are always published  
> to/resolved from a build life, either that of the current build, or  
> of a dependency build.
>
> So what I did was I setup builds for Geronimo Server (the normal  
> server/trunk stuff), which did the normal mvn install thingy, but I  
> always gave it a custom -Dmaven.local.repository which resolved to  
> something inside the working directory for the running build.  The  
> build was still online, so it pulled down a bunch of stuff into an  
> empty local repository (so it was a clean build wrt the repository,  
> as well as the source code, which was always fetched for each new  
> build).  Once the build had finished, I used the artifact publisher  
> task to push *all* of the stuff in the local repository into  
> Codestation, labled as something like "Maven repository artifacts"  
> for the current build life.
>
> Then I setup another build for Apache Geronimo CTS Server (the  
> porting/branches/* stuff).  This build was dependent upon the "Maven  
> repository artifacts" of the Geronimo Server build, and I configured  
> those artifacts to get installed on the build agents system in the  
> same directory that I configured the CTS Server build to use for its  
> local maven repository.  So again the repo started out empty, then  
> got populated with all of the outputs from the normal G build, and  
> then the cts-server build was started.  The build of the components  
> and assemblies is normally fairly quick and aside from some stuff in  
> the private tck repo won't download muck more stuff, because it  
> already had most of its dependencies installed via the Codestation  
> dependency resolution.   Once the build finished, I published to cts- 
> server assembly artifacts back to Codestation under like "CTS Server  
> Assemblies" or something.
>
> Up until this point its normal builds, but now we have built the G  
> server, then built the CTS server (using the *exact* artifacts from  
> the G server build, even though each might have happened on a  
> different build agent).  And now we need to go and run a bunch of  
> tests, using the *exact* CTS server assemblies, produce some output,  
> collect it, and once all of the tests are done render some nice  
> reports, etc.
>
> AHP supports setting up builds which contain "parallel" tasks, each  
> of those tasks is then performed by a build agent, they have fancy  
> build agent selection stuff, but for my needs I had basically 2  
> groups, one group for running the server builds, and then another  
> for running the tests.  I only set aside like 2 agents for builds  
> and the rest for tests.  Oh, I forgot to mention that I had 2 16x  
> 16g AMD beasts all running CentOS 5, each with about 10-12 Xen  
> virtual machines running internally to run build agents.  Each  
> system also had a RAID-0 array setup over 4 disks to help reduce  
> disk io wait, which was as I found out the limiting factor when  
> trying to run a ton of builds that all checkout and download  
> artifacts and such.
>
> I helped the AHP team add a new feature which was an parallel  
> iterator task, so you define *one* task that internally fires off n  
> parallel tasks, which would set the iteration number, and leave it  
> up to the build logic to pick what to do based on that index.  The  
> alternative was a unwieldy set of like 200 tasks in their UI which  
> simply didn't work at all.  You might have notice an  
> "iterations.xml" file in the tck-testsuite directory, this was was  
> was used to take an iteration number and turn it into what tests we  
> actually run.  The <iteration> bits are order sensitive in that file.
>
> Soooo, after we have a CTS Server for a particular G Server build,  
> we can no go an do "runtests" for a specific set of tests (defined  
> by an iteration)... this differed from the other builds above a  
> little, but still pulled down artifacts, the CTS Server assemblies  
> (only the assemblies and the required bits to run the geronimo-maven- 
> plugin, which was used to geronimo:install, as well as used by the  
> tck itself to fire up the server and so on).  The key thing here,  
> with regards to the maven configuration (besides using that custom  
> Codestation populated repository) was that the builds were run  
> *offline*.
>
> After runtests completed, the results are then soaked up (the stuff  
> that javatest pukes out with icky details, as well as the full log  
> files and other stuff I can recall) and then pushed back into  
> Codestation.
>
> Once all of the iterations were finished, another task fires off  
> which generates a report.  It does this by downloading from  
> Codestation all of the runtests outputs (each was zipped I think),  
> unzips them one by one, run some custom goo I wrote (based some of  
> the concepts from original stuff from the GBuild-based TCK  
> automation), and generates a nice Javadoc-like report that includes  
> all of the gory details.
>
> I can't remember how long I spent working on this... too long (not  
> the reports I mean, the whole system).  But in the end I recall  
> something like running an entire TCK testsuite for a single server  
> configuration (like jetty) in about 4-6 hours... I sent mail to the  
> list with the results, so if you are curious what the real number  
> is, instead of my guess, you can look for it there.  But anyway it  
> was damn quick running on just those 2 machines.  And I *knew*  
> exactly that each of the distributed tests was actually testing a  
> known build that I could trace back to its artifacts and then back  
> to its SVN revision, without worrying about mvn downloading  
> something new when midnight rolled over or that a new G server or  
> CTS server build that might be in progress hasn't compromised the  
> testing by polluting the local repository.
>
>  * * *
>
> So, about the sandbox/build-support stuff...
>
> First there is the 'harness' project, which is rather small, but  
> contains the basic stuff, like a version of ant and maven which all  
> of these builds would use, some other internal glue, a  fix for an  
> evil Maven problem causing erroneous build failures due to some  
> internal thread state corruption or gremlins, not sure which.  I  
> kinda used this project to help manage the software needed by normal  
> builds, which is why Ant and Maven were in there... ie. so I didn't  
> have to go install it on each agent each time it changed, just let  
> the AHP system deal with it for me.
>
> This was setup as a normal AHP project, built using its internal Ant  
> builder (though having that builder configured still to use the  
> local version it pulled from SVN to ensure it always works.
>
> Each other build was setup to depend on the output artifacts from  
> the build harness build, using the latest in a range, like say using  
> "3.*" for the latest 3.x build (which looks like that was 3.7).   
> This let me work on new stuff w/o breaking the current builds as I  
> hacked things up.
>
> So, in addition to all of the stuff I mentioned above wrt the G and  
> CTS builds, each also had this step which resolved the build harness  
> artifacts to that working directory, and the Maven builds were  
> always run via the version of Maven included from the harness.  But,  
> AHP didn't actually run that version of Maven directly, it used its  
> internal Ant task to execute the version of Ant from the harness  
> *and* use the harness.xml buildfile.
>
> The harness.xml stuff is some more goo which I wrote to help mange  
> AHP configurations.  With AHP (at that time, not sure if it has  
> changed) you had to do most everything via the web UI, which sucked,  
> and it was hard to refactor sets of projects and so on.  So I came  
> up with a standard set of tasks to execute for a project, then put  
> all of the custom muck I needed into what I called a _library_ and  
> then had the AHP via harness.xml invoke it with some configuration  
> about what project it was and other build details.
>
> The actual harness.xml is not very big, it simply makes sure that */ 
> bin/* is executable (codestation couldn't preserve execute bits),  
> uses the Codestation command-line client (invoking the javaclass  
> directly though) to ask the repository to resolve artifacts from the  
> "Build Library" to the local repository.  I had this artifact  
> resolution separate from the normal dependency (or harness) artifact  
> resolution so that it was easier for me to fix problems with the  
> library while a huge set of TCK iterations were still queued up to  
> run.  Basically, if I noticed a problem due to a code or  
> configuration issue in an early build, I could fix it, and use the  
> existing builds to verify the fix, instead of wasting an hour  
> (sometimes more depending on networking problems accessing remote  
> repos while building the servers) to rebuild and start over.
>
> This brings us to the 'libraries' project.  In general the idea of a  
> _library_ was just a named/versioned collection of files, where you  
> could be used by a project.  The main (er only) library defined in  
> this SVN is system/.  This is the groovy glue which made everything  
> work.  This is where the entry-point class is located (the guy who  
> gets invoked via harness.xml via:
>
>    <target name="harness" depends="init">
>        <groovy>
>            <classpath>
>                <pathelement location="${library.basedir}/groovy"/>
>            </classpath>
>
>            gbuild.system.BuildHarness.bootstrap(this)
>        </groovy>
>    </target>
>
> I won't go into too much detail on this stuff now, take a look at it  
> and ask questions.  But, basically there is stuff in gbuild.system.*  
> which is harness support muck, and stuff in gbuild.config.* which  
> contains configuration.  I was kinda mid-refactoring of some things,  
> starting to add new features, not sure where I left off actually.  
> But the key bits are in gbuild.config.project.*  This contains a  
> package for each project, with the package name being the same as  
> the AHP project (with " " -> "_"). And then in each of those package  
> is at least a Controller.groovy class (or other classes if special  
> muck was needed, like for the report generation in Geronimo_CTS, etc).
>
> The controller defines a set of actions, implemented as Groovy  
> closures bound to properties of the Controller class.  One of the  
> properties passed in from the AHP configuration (configured via the  
> Web UI, passed to the harness.xml build, and then on to the Groovy  
> harness) was the name of the _action_ to execute.  Most of that  
> stuff should be fairly straightforward.
>
> So after a build is started (maybe from a Web UI click, or SVN  
> change detection, or a TCK runtests iteration) the following happens  
> (in simplified terms):
>
>  * Agent starts build
>  * Agent cleans its working directory
>  * Agent downloads the build harness
>  * Agent downloads any dependencies
>  * Agent invoke Ant on harness.xml passing in some details
>  * Harness.xml downloads the system/1 library
>  * Harness.xml runs gbuild.system.BuildHarness
>  * BuildHarness tries to construct a Controller instance for the  
> project
>  * BuildHarness tries to find Controller action to execute
>  * BuildHarness executes the Controller action
>  * Agent publishes output artifacts
>  * Agent completes build
>
> A few extra notes on libraries, the JavaEE TCK requires a bunch of  
> stuff we get from Sun to execute.  This stuff isn't small, but is  
> for the most part read-only.  So I setup a location on each build  
> agent where these files were installed to.  I created AHP projects  
> to manage them and treated them like a special "library" one which  
> tried really hard not to go fetch its content unless the local  
> content was out of date.  This helped speed up the entire build  
> process... cause that delete/download of all that muck really slows  
> down 20 agents running in parallel on 2 big machines with stripped  
> array.  For legal reasons this stuff was not kept in  
> svn.apache.org's main repository, and for logistical reasons wasn't  
> kept in the private tck repo on svn.apache.org either.  Because  
> there were so many files, and be case the httpd configuration on  
> svn.apache.org kicks out requests that it thinks are *bunk* to help  
> save the resources for the community, I had setup a private ssl  
> secured private svn repository on the old gbuild.org machines to put  
> in the full muck required, then setup some goo in the harness to  
> resolve them.  This goo is all in gbuild.system.library.*  See the  
> gbuild.config.projects.Geronimo_CTS.Controller for more of how it  
> was actually used.
>
>  * * *
>
> Okay, that is about all the brain-dump for TCK muck I have in me for  
> tonight.  Reply with questions if you have any.
>
> Cheers,
>
> --jason
>
>
>
>
>
> -- 
> ~Jason Warner


Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
Just got around to reading this.  Thanks for the brain dump, Jason.  No
questions as of yet, but I'm sure I'll need a few more reads before I
understand it all.

On Thu, Oct 2, 2008 at 2:34 PM, Jason Dillon <ja...@gmail.com> wrote:

> On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:
>
>  Is the GBuild stuff in svn the same as the anthill-based code or is that
>> something different?  GBuild seems to have scripts for running tck and that
>> leads me to think they're the same thing, but I see no mention of anthill in
>> the code.
>>
>
> The Anthill stuff is completely different than the GBuild stuff.  I started
> out trying to get the TCK automated using GBuild, but decided that the
> system lacked too many features to perform as I desired, and went ahead with
> Anthill as it did pretty much everything, though had some stability
> problems.
>
> One of the main reasons why I choose Anthill (AHP, Anthill Pro that is) was
> its build agent and code repository systems.  This allowed me to ensure that
> each build used exactly the desired artifacts.  Another was the configurable
> workflow, which allowed me to create a custom chain of events to handle
> running builds on remote agents and control what data gets set to them, what
> it will collect and what logic to execute once all distributed work has been
> completed for a particular build.  And the kicker which help facilitate
> bringing it all together was its concept of a build life.
>
> At the time I could find *no other* build tool which could meet all of
> these needs, and so I went with AHP instead of spending months
> building/testing features in GBuild.
>
> While AHP supports configuring a lot of stuff via its web-interface, I
> found that it was very cumbersome, so I opted to write some glue, which was
> stored in svn here:
>
>
> https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245
>
> Its been a while, so I have to refresh my memory on how this stuff actually
> worked.  First let me explain about the code repository (what it calls
> codestation) and why it was critical to the TCK testing IMO.  When we use
> Maven normally, it pulls data from a set of external repositories, picks up
> more repositories from the stuff it downloads and quickly we loose control
> where stuff comes from.  After it pulls down all that stuff, it churns
> though a build and spits out the stuff we care about, normally stuffing them
> (via mvn install) into the local repository.
>
> AHP supports by default tasks to publish artifacts (really just a set of
> files controlled by an Ant-like include/exclude path) from a build agent
> into Codestation, as well as tasks to resolve artifacts (ie. download them
> from Codestation to the local working directory on the build agents system).
>  Each top-level build in AHP gets assigned a new (empty) build life.
>  Artifacts are always published to/resolved from a build life, either that
> of the current build, or of a dependency build.
>
> So what I did was I setup builds for Geronimo Server (the normal
> server/trunk stuff), which did the normal mvn install thingy, but I always
> gave it a custom -Dmaven.local.repository which resolved to something inside
> the working directory for the running build.  The build was still online, so
> it pulled down a bunch of stuff into an empty local repository (so it was a
> clean build wrt the repository, as well as the source code, which was always
> fetched for each new build).  Once the build had finished, I used the
> artifact publisher task to push *all* of the stuff in the local repository
> into Codestation, labled as something like "Maven repository artifacts" for
> the current build life.
>
> Then I setup another build for Apache Geronimo CTS Server (the
> porting/branches/* stuff).  This build was dependent upon the "Maven
> repository artifacts" of the Geronimo Server build, and I configured those
> artifacts to get installed on the build agents system in the same directory
> that I configured the CTS Server build to use for its local maven
> repository.  So again the repo started out empty, then got populated with
> all of the outputs from the normal G build, and then the cts-server build
> was started.  The build of the components and assemblies is normally fairly
> quick and aside from some stuff in the private tck repo won't download muck
> more stuff, because it already had most of its dependencies installed via
> the Codestation dependency resolution.   Once the build finished, I
> published to cts-server assembly artifacts back to Codestation under like
> "CTS Server Assemblies" or something.
>
> Up until this point its normal builds, but now we have built the G server,
> then built the CTS server (using the *exact* artifacts from the G server
> build, even though each might have happened on a different build agent).
>  And now we need to go and run a bunch of tests, using the *exact* CTS
> server assemblies, produce some output, collect it, and once all of the
> tests are done render some nice reports, etc.
>
> AHP supports setting up builds which contain "parallel" tasks, each of
> those tasks is then performed by a build agent, they have fancy build agent
> selection stuff, but for my needs I had basically 2 groups, one group for
> running the server builds, and then another for running the tests.  I only
> set aside like 2 agents for builds and the rest for tests.  Oh, I forgot to
> mention that I had 2 16x 16g AMD beasts all running CentOS 5, each with
> about 10-12 Xen virtual machines running internally to run build agents.
>  Each system also had a RAID-0 array setup over 4 disks to help reduce disk
> io wait, which was as I found out the limiting factor when trying to run a
> ton of builds that all checkout and download artifacts and such.
>
> I helped the AHP team add a new feature which was an parallel iterator
> task, so you define *one* task that internally fires off n parallel tasks,
> which would set the iteration number, and leave it up to the build logic to
> pick what to do based on that index.  The alternative was a unwieldy set of
> like 200 tasks in their UI which simply didn't work at all.  You might have
> notice an "iterations.xml" file in the tck-testsuite directory, this was was
> was used to take an iteration number and turn it into what tests we actually
> run.  The <iteration> bits are order sensitive in that file.
>
> Soooo, after we have a CTS Server for a particular G Server build, we can
> no go an do "runtests" for a specific set of tests (defined by an
> iteration)... this differed from the other builds above a little, but still
> pulled down artifacts, the CTS Server assemblies (only the assemblies and
> the required bits to run the geronimo-maven-plugin, which was used to
> geronimo:install, as well as used by the tck itself to fire up the server
> and so on).  The key thing here, with regards to the maven configuration
> (besides using that custom Codestation populated repository) was that the
> builds were run *offline*.
>
> After runtests completed, the results are then soaked up (the stuff that
> javatest pukes out with icky details, as well as the full log files and
> other stuff I can recall) and then pushed back into Codestation.
>
> Once all of the iterations were finished, another task fires off which
> generates a report.  It does this by downloading from Codestation all of the
> runtests outputs (each was zipped I think), unzips them one by one, run some
> custom goo I wrote (based some of the concepts from original stuff from the
> GBuild-based TCK automation), and generates a nice Javadoc-like report that
> includes all of the gory details.
>
> I can't remember how long I spent working on this... too long (not the
> reports I mean, the whole system).  But in the end I recall something like
> running an entire TCK testsuite for a single server configuration (like
> jetty) in about 4-6 hours... I sent mail to the list with the results, so if
> you are curious what the real number is, instead of my guess, you can look
> for it there.  But anyway it was damn quick running on just those 2
> machines.  And I *knew* exactly that each of the distributed tests was
> actually testing a known build that I could trace back to its artifacts and
> then back to its SVN revision, without worrying about mvn downloading
> something new when midnight rolled over or that a new G server or CTS server
> build that might be in progress hasn't compromised the testing by polluting
> the local repository.
>
>  * * *
>
> So, about the sandbox/build-support stuff...
>
> First there is the 'harness' project, which is rather small, but contains
> the basic stuff, like a version of ant and maven which all of these builds
> would use, some other internal glue, a  fix for an evil Maven problem
> causing erroneous build failures due to some internal thread state
> corruption or gremlins, not sure which.  I kinda used this project to help
> manage the software needed by normal builds, which is why Ant and Maven were
> in there... ie. so I didn't have to go install it on each agent each time it
> changed, just let the AHP system deal with it for me.
>
> This was setup as a normal AHP project, built using its internal Ant
> builder (though having that builder configured still to use the local
> version it pulled from SVN to ensure it always works.
>
> Each other build was setup to depend on the output artifacts from the build
> harness build, using the latest in a range, like say using "3.*" for the
> latest 3.x build (which looks like that was 3.7).  This let me work on new
> stuff w/o breaking the current builds as I hacked things up.
>
> So, in addition to all of the stuff I mentioned above wrt the G and CTS
> builds, each also had this step which resolved the build harness artifacts
> to that working directory, and the Maven builds were always run via the
> version of Maven included from the harness.  But, AHP didn't actually run
> that version of Maven directly, it used its internal Ant task to execute the
> version of Ant from the harness *and* use the harness.xml buildfile.
>
> The harness.xml stuff is some more goo which I wrote to help mange AHP
> configurations.  With AHP (at that time, not sure if it has changed) you had
> to do most everything via the web UI, which sucked, and it was hard to
> refactor sets of projects and so on.  So I came up with a standard set of
> tasks to execute for a project, then put all of the custom muck I needed
> into what I called a _library_ and then had the AHP via harness.xml invoke
> it with some configuration about what project it was and other build
> details.
>
> The actual harness.xml is not very big, it simply makes sure that */bin/*
> is executable (codestation couldn't preserve execute bits), uses the
> Codestation command-line client (invoking the javaclass directly though) to
> ask the repository to resolve artifacts from the "Build Library" to the
> local repository.  I had this artifact resolution separate from the normal
> dependency (or harness) artifact resolution so that it was easier for me to
> fix problems with the library while a huge set of TCK iterations were still
> queued up to run.  Basically, if I noticed a problem due to a code or
> configuration issue in an early build, I could fix it, and use the existing
> builds to verify the fix, instead of wasting an hour (sometimes more
> depending on networking problems accessing remote repos while building the
> servers) to rebuild and start over.
>
> This brings us to the 'libraries' project.  In general the idea of a
> _library_ was just a named/versioned collection of files, where you could be
> used by a project.  The main (er only) library defined in this SVN is
> system/.  This is the groovy glue which made everything work.  This is where
> the entry-point class is located (the guy who gets invoked via harness.xml
> via:
>
>    <target name="harness" depends="init">
>        <groovy>
>            <classpath>
>                <pathelement location="${library.basedir}/groovy"/>
>            </classpath>
>
>            gbuild.system.BuildHarness.bootstrap(this)
>        </groovy>
>    </target>
>
> I won't go into too much detail on this stuff now, take a look at it and
> ask questions.  But, basically there is stuff in gbuild.system.* which is
> harness support muck, and stuff in gbuild.config.* which contains
> configuration.  I was kinda mid-refactoring of some things, starting to add
> new features, not sure where I left off actually. But the key bits are in
> gbuild.config.project.*  This contains a package for each project, with the
> package name being the same as the AHP project (with " " -> "_"). And then
> in each of those package is at least a Controller.groovy class (or other
> classes if special muck was needed, like for the report generation in
> Geronimo_CTS, etc).
>
> The controller defines a set of actions, implemented as Groovy closures
> bound to properties of the Controller class.  One of the properties passed
> in from the AHP configuration (configured via the Web UI, passed to the
> harness.xml build, and then on to the Groovy harness) was the name of the
> _action_ to execute.  Most of that stuff should be fairly straightforward.
>
> So after a build is started (maybe from a Web UI click, or SVN change
> detection, or a TCK runtests iteration) the following happens (in simplified
> terms):
>
>  * Agent starts build
>  * Agent cleans its working directory
>  * Agent downloads the build harness
>  * Agent downloads any dependencies
>  * Agent invoke Ant on harness.xml passing in some details
>  * Harness.xml downloads the system/1 library
>  * Harness.xml runs gbuild.system.BuildHarness
>  * BuildHarness tries to construct a Controller instance for the project
>  * BuildHarness tries to find Controller action to execute
>  * BuildHarness executes the Controller action
>  * Agent publishes output artifacts
>  * Agent completes build
>
> A few extra notes on libraries, the JavaEE TCK requires a bunch of stuff we
> get from Sun to execute.  This stuff isn't small, but is for the most part
> read-only.  So I setup a location on each build agent where these files were
> installed to.  I created AHP projects to manage them and treated them like a
> special "library" one which tried really hard not to go fetch its content
> unless the local content was out of date.  This helped speed up the entire
> build process... cause that delete/download of all that muck really slows
> down 20 agents running in parallel on 2 big machines with stripped array.
>  For legal reasons this stuff was not kept in svn.apache.org's main
> repository, and for logistical reasons wasn't kept in the private tck repo
> on svn.apache.org either.  Because there were so many files, and be case
> the httpd configuration on svn.apache.org kicks out requests that it
> thinks are *bunk* to help save the resources for the community, I had setup
> a private ssl secured private svn repository on the old gbuild.orgmachines to put in the full muck required, then setup some goo in the
> harness to resolve them.  This goo is all in gbuild.system.library.*  See
> the gbuild.config.projects.Geronimo_CTS.Controller for more of how it was
> actually used.
>
>  * * *
>
> Okay, that is about all the brain-dump for TCK muck I have in me for
> tonight.  Reply with questions if you have any.
>
> Cheers,
>
> --jason
>
>
>


-- 
~Jason Warner

Re: Continuous TCK Testing

Posted by Jason Dillon <ja...@gmail.com>.
On Oct 1, 2008, at 11:20 PM, Jason Warner wrote:

> Is the GBuild stuff in svn the same as the anthill-based code or is  
> that something different?  GBuild seems to have scripts for running  
> tck and that leads me to think they're the same thing, but I see no  
> mention of anthill in the code.

The Anthill stuff is completely different than the GBuild stuff.  I  
started out trying to get the TCK automated using GBuild, but decided  
that the system lacked too many features to perform as I desired, and  
went ahead with Anthill as it did pretty much everything, though had  
some stability problems.

One of the main reasons why I choose Anthill (AHP, Anthill Pro that  
is) was its build agent and code repository systems.  This allowed me  
to ensure that each build used exactly the desired artifacts.  Another  
was the configurable workflow, which allowed me to create a custom  
chain of events to handle running builds on remote agents and control  
what data gets set to them, what it will collect and what logic to  
execute once all distributed work has been completed for a particular  
build.  And the kicker which help facilitate bringing it all together  
was its concept of a build life.

At the time I could find *no other* build tool which could meet all of  
these needs, and so I went with AHP instead of spending months  
building/testing features in GBuild.

While AHP supports configuring a lot of stuff via its web-interface, I  
found that it was very cumbersome, so I opted to write some glue,  
which was stored in svn here:

     https://svn.apache.org/viewvc/geronimo/sandbox/build-support/?pathrev=632245

Its been a while, so I have to refresh my memory on how this stuff  
actually worked.  First let me explain about the code repository (what  
it calls codestation) and why it was critical to the TCK testing IMO.   
When we use Maven normally, it pulls data from a set of external  
repositories, picks up more repositories from the stuff it downloads  
and quickly we loose control where stuff comes from.  After it pulls  
down all that stuff, it churns though a build and spits out the stuff  
we care about, normally stuffing them (via mvn install) into the local  
repository.

AHP supports by default tasks to publish artifacts (really just a set  
of files controlled by an Ant-like include/exclude path) from a build  
agent into Codestation, as well as tasks to resolve artifacts (ie.  
download them from Codestation to the local working directory on the  
build agents system).  Each top-level build in AHP gets assigned a new  
(empty) build life.  Artifacts are always published to/resolved from a  
build life, either that of the current build, or of a dependency build.

So what I did was I setup builds for Geronimo Server (the normal  
server/trunk stuff), which did the normal mvn install thingy, but I  
always gave it a custom -Dmaven.local.repository which resolved to  
something inside the working directory for the running build.  The  
build was still online, so it pulled down a bunch of stuff into an  
empty local repository (so it was a clean build wrt the repository, as  
well as the source code, which was always fetched for each new  
build).  Once the build had finished, I used the artifact publisher  
task to push *all* of the stuff in the local repository into  
Codestation, labled as something like "Maven repository artifacts" for  
the current build life.

Then I setup another build for Apache Geronimo CTS Server (the porting/ 
branches/* stuff).  This build was dependent upon the "Maven  
repository artifacts" of the Geronimo Server build, and I configured  
those artifacts to get installed on the build agents system in the  
same directory that I configured the CTS Server build to use for its  
local maven repository.  So again the repo started out empty, then got  
populated with all of the outputs from the normal G build, and then  
the cts-server build was started.  The build of the components and  
assemblies is normally fairly quick and aside from some stuff in the  
private tck repo won't download muck more stuff, because it already  
had most of its dependencies installed via the Codestation dependency  
resolution.   Once the build finished, I published to cts-server  
assembly artifacts back to Codestation under like "CTS Server  
Assemblies" or something.

Up until this point its normal builds, but now we have built the G  
server, then built the CTS server (using the *exact* artifacts from  
the G server build, even though each might have happened on a  
different build agent).  And now we need to go and run a bunch of  
tests, using the *exact* CTS server assemblies, produce some output,  
collect it, and once all of the tests are done render some nice  
reports, etc.

AHP supports setting up builds which contain "parallel" tasks, each of  
those tasks is then performed by a build agent, they have fancy build  
agent selection stuff, but for my needs I had basically 2 groups, one  
group for running the server builds, and then another for running the  
tests.  I only set aside like 2 agents for builds and the rest for  
tests.  Oh, I forgot to mention that I had 2 16x 16g AMD beasts all  
running CentOS 5, each with about 10-12 Xen virtual machines running  
internally to run build agents.  Each system also had a RAID-0 array  
setup over 4 disks to help reduce disk io wait, which was as I found  
out the limiting factor when trying to run a ton of builds that all  
checkout and download artifacts and such.

I helped the AHP team add a new feature which was an parallel iterator  
task, so you define *one* task that internally fires off n parallel  
tasks, which would set the iteration number, and leave it up to the  
build logic to pick what to do based on that index.  The alternative  
was a unwieldy set of like 200 tasks in their UI which simply didn't  
work at all.  You might have notice an "iterations.xml" file in the  
tck-testsuite directory, this was was was used to take an iteration  
number and turn it into what tests we actually run.  The <iteration>  
bits are order sensitive in that file.

Soooo, after we have a CTS Server for a particular G Server build, we  
can no go an do "runtests" for a specific set of tests (defined by an  
iteration)... this differed from the other builds above a little, but  
still pulled down artifacts, the CTS Server assemblies (only the  
assemblies and the required bits to run the geronimo-maven-plugin,  
which was used to geronimo:install, as well as used by the tck itself  
to fire up the server and so on).  The key thing here, with regards to  
the maven configuration (besides using that custom Codestation  
populated repository) was that the builds were run *offline*.

After runtests completed, the results are then soaked up (the stuff  
that javatest pukes out with icky details, as well as the full log  
files and other stuff I can recall) and then pushed back into  
Codestation.

Once all of the iterations were finished, another task fires off which  
generates a report.  It does this by downloading from Codestation all  
of the runtests outputs (each was zipped I think), unzips them one by  
one, run some custom goo I wrote (based some of the concepts from  
original stuff from the GBuild-based TCK automation), and generates a  
nice Javadoc-like report that includes all of the gory details.

I can't remember how long I spent working on this... too long (not the  
reports I mean, the whole system).  But in the end I recall something  
like running an entire TCK testsuite for a single server configuration  
(like jetty) in about 4-6 hours... I sent mail to the list with the  
results, so if you are curious what the real number is, instead of my  
guess, you can look for it there.  But anyway it was damn quick  
running on just those 2 machines.  And I *knew* exactly that each of  
the distributed tests was actually testing a known build that I could  
trace back to its artifacts and then back to its SVN revision, without  
worrying about mvn downloading something new when midnight rolled over  
or that a new G server or CTS server build that might be in progress  
hasn't compromised the testing by polluting the local repository.

  * * *

So, about the sandbox/build-support stuff...

First there is the 'harness' project, which is rather small, but  
contains the basic stuff, like a version of ant and maven which all of  
these builds would use, some other internal glue, a  fix for an evil  
Maven problem causing erroneous build failures due to some internal  
thread state corruption or gremlins, not sure which.  I kinda used  
this project to help manage the software needed by normal builds,  
which is why Ant and Maven were in there... ie. so I didn't have to go  
install it on each agent each time it changed, just let the AHP system  
deal with it for me.

This was setup as a normal AHP project, built using its internal Ant  
builder (though having that builder configured still to use the local  
version it pulled from SVN to ensure it always works.

Each other build was setup to depend on the output artifacts from the  
build harness build, using the latest in a range, like say using "3.*"  
for the latest 3.x build (which looks like that was 3.7).  This let me  
work on new stuff w/o breaking the current builds as I hacked things up.

So, in addition to all of the stuff I mentioned above wrt the G and  
CTS builds, each also had this step which resolved the build harness  
artifacts to that working directory, and the Maven builds were always  
run via the version of Maven included from the harness.  But, AHP  
didn't actually run that version of Maven directly, it used its  
internal Ant task to execute the version of Ant from the harness *and*  
use the harness.xml buildfile.

The harness.xml stuff is some more goo which I wrote to help mange AHP  
configurations.  With AHP (at that time, not sure if it has changed)  
you had to do most everything via the web UI, which sucked, and it was  
hard to refactor sets of projects and so on.  So I came up with a  
standard set of tasks to execute for a project, then put all of the  
custom muck I needed into what I called a _library_ and then had the  
AHP via harness.xml invoke it with some configuration about what  
project it was and other build details.

The actual harness.xml is not very big, it simply makes sure that */ 
bin/* is executable (codestation couldn't preserve execute bits), uses  
the Codestation command-line client (invoking the javaclass directly  
though) to ask the repository to resolve artifacts from the "Build  
Library" to the local repository.  I had this artifact resolution  
separate from the normal dependency (or harness) artifact resolution  
so that it was easier for me to fix problems with the library while a  
huge set of TCK iterations were still queued up to run.  Basically, if  
I noticed a problem due to a code or configuration issue in an early  
build, I could fix it, and use the existing builds to verify the fix,  
instead of wasting an hour (sometimes more depending on networking  
problems accessing remote repos while building the servers) to rebuild  
and start over.

This brings us to the 'libraries' project.  In general the idea of a  
_library_ was just a named/versioned collection of files, where you  
could be used by a project.  The main (er only) library defined in  
this SVN is system/.  This is the groovy glue which made everything  
work.  This is where the entry-point class is located (the guy who  
gets invoked via harness.xml via:

     <target name="harness" depends="init">
         <groovy>
             <classpath>
                 <pathelement location="${library.basedir}/groovy"/>
             </classpath>

             gbuild.system.BuildHarness.bootstrap(this)
         </groovy>
     </target>

I won't go into too much detail on this stuff now, take a look at it  
and ask questions.  But, basically there is stuff in gbuild.system.*  
which is harness support muck, and stuff in gbuild.config.* which  
contains configuration.  I was kinda mid-refactoring of some things,  
starting to add new features, not sure where I left off actually. But  
the key bits are in gbuild.config.project.*  This contains a package  
for each project, with the package name being the same as the AHP  
project (with " " -> "_"). And then in each of those package is at  
least a Controller.groovy class (or other classes if special muck was  
needed, like for the report generation in Geronimo_CTS, etc).

The controller defines a set of actions, implemented as Groovy  
closures bound to properties of the Controller class.  One of the  
properties passed in from the AHP configuration (configured via the  
Web UI, passed to the harness.xml build, and then on to the Groovy  
harness) was the name of the _action_ to execute.  Most of that stuff  
should be fairly straightforward.

So after a build is started (maybe from a Web UI click, or SVN change  
detection, or a TCK runtests iteration) the following happens (in  
simplified terms):

  * Agent starts build
  * Agent cleans its working directory
  * Agent downloads the build harness
  * Agent downloads any dependencies
  * Agent invoke Ant on harness.xml passing in some details
  * Harness.xml downloads the system/1 library
  * Harness.xml runs gbuild.system.BuildHarness
  * BuildHarness tries to construct a Controller instance for the  
project
  * BuildHarness tries to find Controller action to execute
  * BuildHarness executes the Controller action
  * Agent publishes output artifacts
  * Agent completes build

A few extra notes on libraries, the JavaEE TCK requires a bunch of  
stuff we get from Sun to execute.  This stuff isn't small, but is for  
the most part read-only.  So I setup a location on each build agent  
where these files were installed to.  I created AHP projects to manage  
them and treated them like a special "library" one which tried really  
hard not to go fetch its content unless the local content was out of  
date.  This helped speed up the entire build process... cause that  
delete/download of all that muck really slows down 20 agents running  
in parallel on 2 big machines with stripped array.  For legal reasons  
this stuff was not kept in svn.apache.org's main repository, and for  
logistical reasons wasn't kept in the private tck repo on  
svn.apache.org either.  Because there were so many files, and be case  
the httpd configuration on svn.apache.org kicks out requests that it  
thinks are *bunk* to help save the resources for the community, I had  
setup a private ssl secured private svn repository on the old  
gbuild.org machines to put in the full muck required, then setup some  
goo in the harness to resolve them.  This goo is all in  
gbuild.system.library.*  See the  
gbuild.config.projects.Geronimo_CTS.Controller for more of how it was  
actually used.

  * * *

Okay, that is about all the brain-dump for TCK muck I have in me for  
tonight.  Reply with questions if you have any.

Cheers,

--jason



Re: Continuous TCK Testing

Posted by Jason Warner <ja...@gmail.com>.
Is the GBuild stuff in svn the same as the anthill-based code or is that
something different?  GBuild seems to have scripts for running tck and that
leads me to think they're the same thing, but I see no mention of anthill in
the code.

On Wed, Oct 1, 2008 at 9:56 AM, Kevan Miller <ke...@gmail.com> wrote:

>
> Not seeing too much progress here.  Has anyone dug up the Anthill-based
> code? I'll have a look.
>
> --kevan
>



-- 
~Jason Warner