You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@gump.apache.org by Stefano Mazzocchi <st...@apache.org> on 2004/12/07 00:13:28 UTC

[RT] Gump 3.0

I've been working for a while to describe an improved architecture for 
Gump and I have decided to "go public" with the discussion because I 
want this to be a community effort.

                               - o -

First and foremost, I believe that gump is one of the most exiting 
things happening that ever happened in the software space over the last 
few years but I also thinks that both technical, architectural and 
social limitations are stopping it from exihibit its real potential.

The biggest problem I have is the fact that gump is such an integrated 
system: it tries to do too much in one single stage.

Don't get me wrong: the internals  of gump 2.x are rather modular and 
well architected, but the overall system architecture is too monolithic.

So, here is my first suggestion: split gump in three stages.

  1) metadata aggregation
  2) build
  3) build data use

                                - o -

Stage 1: Metadata aggregation
-----------------------------

Gump will socially scale only when the metadata about the problem will 
be taken care by the people that administer the project rather then a 
few gump meisters.

In this regard, I believe Maven to be far superior in term of 
gump-friendliness than ant because of its complete declarative nature 
(ant builds are a functional language, where project metadata cannot be 
transparently be inferred from them).

In a perfect world, all project would *need* an metadata representation 
of their structure so that a build tool can parse that and understand 
what the project needs.

In the real world, there are two camps:

  1) procedural: make,configure,sh,ant
  2) declerative: maven,apt-get,ports

and the second normally build on the first one.

The absolute need for gump (or apt-get, or BSD ports) is to have a 
"declarative" layer on top of the "procedural" one for every project, a 
'semantic' layer that the system can understand and work on.

Debian shows that it's possible to socially scale the concept of adding 
a semantic layer on top of existing project efforts, in a completely 
independent fashion.

Maven shows that it's possible for the projects themselves to make good 
use of this information (also calling ant, if special needs are required).

For gump, what's important is that having maven generate gump 
descriptors is both stupid and inefficient: gump should be able to 
digest directly the maven POM, without requiring any effort from the 
project.

We should be maintaing the metadata representation only for the projects 
that don't have that data integrated in their build system (like pure 
ant projects or make/configure projects).

So, what is a metadata aggregation layer?

It's a crawler for project metadata. Crawls project and their 
descriptors and aggregates them in a service that can be queried to 
obtain that information.

In short

    [bunch of locations] --> crawler --> metadata database

                              - o -

Stage 2: Build
--------------

This is what today we think as "gump". In short, it's the service that 
uses the project metadata, does the fetching, preparing, building and 
generates a bunch of data as a result.

The difference from today's gump is that this "build-only gump" outputs 
data into a database, not into HTML pages or RSS scripts. The build 
stage and the data use stage are separated.

In short:

    metadata database --> gump --> build data database

                               - o -

Stage 3: Build Data Use
-----------------------

This is what todays is performed by the 'actors' inside Gump 2.x, the 
current actors are:

  1) document
  2) repository
  3) notify
  4) stats
  5) syndication
  6) timing
  7) rdf
  8) mysql
  9) results

we could aggregate them in the following taxonomy:

  [web]
    [html]
     document -> creates the forrest output
     results -> creates the XHTML output
      stats -> does the stats part
      timing -> does the timing part
    [others]
     syndication -> does the RSS feeds
     RDF -> does the RDF descriptors
  [email]
    notify -> notifies the mail lists
  [history]
    mysql -> saves historical data
    repository -> saves the built jar files

My suggestion is to remove all those away from the stage 2 and just let 
the "historical" actors be in stage 2 (basically pumping all the data 
into the historical database) and let the others reside in stage 3.

So, for stage 3 I see two possible services:

  1) the web service, taking care of things like:
      - web pages
      - historical graphs
      - syndication of results

  2) the notification service, taking care of sending emails to the 
various projects

In short:

    metadata database   --+  +--> email notifier
                          +--+
    build data database --+  +--> webapp

                          - o -

Advantages
----------

This new architecture has several advantages:

  1) the concerns are more easily separated, also means that different 
stages can be built using different languages. The webapp, for example, 
that I'm working working on (codename 'dynagump' and located in 
http://svn.apache.org/repos/asf/gump/dynagump/trunk) is a Cocoon 
application.

  2) by decoupling the architecture, it's easier to have multiple 
machines running the second stage in parallel (both controlled by us or 
simply donated by the users) for example

                         --- Debian on x86 ---
                        /                     \
                       /                       v
      metadata database ---- MacOSX on PPC ---> build data database
                       \                       ^
                        \                     /
                         --- WinXP on x86 ----

  *and* is also easier to install a "build stage" on a given machine, 
since the metadata bootstrap phase should be done automatically. for 
example, it should be sufficient to say "gump build asf:cocoon" in order 
to the whole system to be prepared and packaged and ready to go.

  3) also by allowing gump to adapt the existing descriptors into a 
database form, it's easier to empower users by either allowing them to 
maintain their data in the original form (ie. Maven descriptors) or to 
adapt/modify the data in the database directly (for example, thru a web 
application).

  4) the contracts between the stages are databases, once these models 
are codified, it's possible for the three stages to work in complete 
isolation, without affecting one another.

Comments?

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by Stefano Mazzocchi <st...@apache.org>.

Adam R. B. Jack wrote:
> Ok, here is my thinking on how we proceed towards Gump 3.0, i.e.:
> 
>     1) Metadata Gathering
>     2) Processing (Build/Sync/Update)
>     3) Results/Presentation/History Query/Analysis
> 
>  ------------------------------------------------------------------------
> Fnor *now* ...
> 
> 1) Phase One (Metadata Gathering) is simply the way to get XML documention
> into a local file system for Gump to process. Eventually this could be
> crawlers (etc.) that parse GOMs and POMs, but (for now) the CVS update &
> HTTP gets are tolerable. [If anybody has an itch to tackle this first, speak
> up, but I think it is a reasonable/significant amount of work and (IMHO) can
> wait a little while longer.]

+1

> 2) Phase Two (Building) is what we currently have as core, but that outputs
> to an historical database (plus some files for those w/o huge databases). It
> will not do RDF/RSS/Atom/Notification/XHTML Presentation (or XDOCS). It will
> not do Stats (neither XHTML presentation nor internal to DBM) nor will it do
> XRef (XHTML).

+1

> 3) Phase Three  (Analysis/Communication) is a whole new world; re-writting
> the 'will not do' list from above from the results database. This could be
> Python code, or Cocoon, or ...
> 
> I'd like to focus my time on (2) and request that others help with (3).

I'm game. I can take ownership of #3.

> Question: We currently run JDK1.5 and Kaffe off TRUNK not LIVE. Ought we
> change this? 

yeah, it makes sense.

> Alternatively, ought we perform this Gump work in a separate
> branch. I think I can add to the current w/o too much instability, then
> remove stuff when needed. I'm game to listen to others opinions/concerns
> though.

Currently, Dynagump is the code name for "#3" and does not depend on any 
code from Gump (only on a common database schema).

I think we keep it the way it is for now, we can move stuff back and 
forth later on, thanks to SVN.

> [FWIIW: Personally, I'd love to get back to NAnt building except that Mono
> is still my roadblock.I think Gump 3.0 ought be far less resource bound, and
> it ought help us simplify running/operating Gump. As such, I hope it leads
> to more users and hence more hands to help with NAnt, etc.]

I personally would love to see Mono stuff being gumped as well, but it's 
a low priority for me ATM.

-- 
Stefano.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by Stefan Bodewig <bo...@apache.org>.

On Fri, 10 Dec 2004, Adam R. B. Jack <aj...@apache.org> wrote:

> [FWIIW: Personally, I'd love to get back to NAnt building except
> that Mono is still my roadblock.

I still don't quite understand why it works far better on my oldish
RedHat box either.  Hmm, have we tried Mono 1.0.4 or even 1.0.5
(released today 8-) yet?

Anyway.  Once I merge my lst commit to the live branch we will build
apr-util against apr and everything should be there to support
configure/make based projects (we may need env variable support).  My
next prio will be documenting the stuff so that others like Graham can
get their feet wet - and then head towards NAnt and Mono.

This is what I expect to be able to do, I'll probably never dive into
Python (lack of time - and admittedly it hasn't been fun yet, either)
deep enough in order to scratch more than the surface.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by "Adam R. B. Jack" <aj...@apache.org>.

Ok, here is my thinking on how we proceed towards Gump 3.0, i.e.:

    1) Metadata Gathering
    2) Processing (Build/Sync/Update)
    3) Results/Presentation/History Query/Analysis

 ------------------------------------------------------------------------
Fnor *now* ...

1) Phase One (Metadata Gathering) is simply the way to get XML documention
into a local file system for Gump to process. Eventually this could be
crawlers (etc.) that parse GOMs and POMs, but (for now) the CVS update &
HTTP gets are tolerable. [If anybody has an itch to tackle this first, speak
up, but I think it is a reasonable/significant amount of work and (IMHO) can
wait a little while longer.]

2) Phase Two (Building) is what we currently have as core, but that outputs
to an historical database (plus some files for those w/o huge databases). It
will not do RDF/RSS/Atom/Notification/XHTML Presentation (or XDOCS). It will
not do Stats (neither XHTML presentation nor internal to DBM) nor will it do
XRef (XHTML).

3) Phase Three  (Analysis/Communication) is a whole new world; re-writting
the 'will not do' list from above from the results database. This could be
Python code, or Cocoon, or ...

I'd like to focus my time on (2) and request that others help with (3).

Question: We currently run JDK1.5 and Kaffe off TRUNK not LIVE. Ought we
change this? Alternatively, ought we perform this Gump work in a separate
branch. I think I can add to the current w/o too much instability, then
remove stuff when needed. I'm game to listen to others opinions/concerns
though.

[FWIIW: Personally, I'd love to get back to NAnt building except that Mono
is still my roadblock.I think Gump 3.0 ought be far less resource bound, and
it ought help us simplify running/operating Gump. As such, I hope it leads
to more users and hence more hands to help with NAnt, etc.]

regards,

Adam


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by "Adam R. B. Jack" <aj...@apache.org>.

> 2a) SCM update
> 2b) syncing updated working copy with workspace
> 2c) building

We do actually have 2a and 2c already, in bin/build.py and bin/update.py,
they just never got the usage/fixing they might need.

regards

Adam


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by Stefan Bodewig <bo...@apache.org>.

On Wed, 08 Dec 2004, Stefan Bodewig <bo...@apache.org> wrote:
> On Mon, 06 Dec 2004, Stefano Mazzocchi <st...@apache.org> wrote:
> 
>> So, here is my first suggestion: split gump in three stages.
>> 
>>   1) metadata aggregation
>>   2) build
>>   3) build data use
> 
> Sounds good.

One additional thing.

I'd love to have part 2 separated into at least three steps that can
get invoked indiviually:

2a) SCM update
2b) syncing updated working copy with workspace
2c) building

With "traditional Gump" it has been possible to modify classes in the
workspace and rebuild using Gump.  This has been very useful in
resolving Gump problems in the past.  Right now I don't see an easy
way to do this.

For example, I "fixed" the commons-jelly-tags-ant build by patching
the jelly-util taglib.  I verified it would fix the Gump build by
applying my patch locally and only building commons-jelly-tags-util
and after that commons-jelly-tags-ant.

Using current Gump my local patch would have been blown away by CVS
updates or syncs - unless I applied it in what is supposed to be a
clean checkout and disconnected from the network.

Also, just building commons-jelly-tags-util and commons-jelly-tags-ant
without rebuilding Ant and all that seems to be impossible right now
(I may be wrong, though).

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by Stefan Bodewig <bo...@apache.org>.

On Mon, 06 Dec 2004, Stefano Mazzocchi <st...@apache.org> wrote:

> So, here is my first suggestion: split gump in three stages.
> 
>   1) metadata aggregation
>   2) build
>   3) build data use

Sounds good.

> We should be maintaing the metadata representation only for the
> projects that don't have that data integrated in their build system
> (like pure ant projects or make/configure projects).

Even the later may have them in some form, like RPM spec files, it may
be worth to look into them (some time later) as well.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by David Crossley <cr...@apache.org>.

Stefano Mazzocchi wrote:
> I've been working for a while to describe an improved architecture for 
> Gump and I have decided to "go public" with the discussion because I 
> want this to be a community effort.

It was great to particpate with you and Leo IRL at ApacheCon
over some of this - another apsect of the community effort.

Thanks for going the next step.

[snip]
> 
> Comments?

Stefano hits the nails on the head again.

--David

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by "Adam R. B. Jack" <aj...@apache.org>.

> Comments?

This says it all for me:

    > The biggest problem I have is the fact that gump is such an integrated
    > system: it tries to do too much in one single stage.

I don't mind if the "contract"/communication between phases is some RDF
store, or database, or whatever, but I do want to have this separation. We
also need to ensure that (this time) we have the commandline run (Random Joe
running Gump) figured out. It needs to be as easy to do each/any stage
manually as Sam used to find it. Smaller steps might just make that easier
to achieve.

regards,

Adam


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: [RT] Gump 3.0

Posted by Leo Simons <ls...@jicarilla.org>.

Stefano Mazzocchi wrote:
> Comments?

Not really. Most of it sounds obvious by now, actually :-D

More images related to this architecture are at:

	http://svn.apache.org/repos/asf/gump/trunk/src/xdocs/gump.pdf

though I'm afraid some of the comments in the gump.ppt alongside there 
didn't make it into the PDF.

I'll also point out that your RT (probably on purpose) leaves out a 
*lot* of talk about (lifting) social limitations. The fun bit about the 
thinking there is that it tends to span all those stages and database. 
That really needs to be written down as well at some point so some of 
the design decisions make more sense :-D

Finally I'll point out (just to keep this e-mail short, really, there's 
a lot to say), one other thing to realize is that this 
DB-based-architecture will help us move away from the batch-based 
approach we have right now.


- LSD

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org