You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-dev@apache.org by Jukka Zitting <ju...@gmail.com> on 2008/03/01 13:06:11 UTC

[scm] Use case: Continuous integration

Hi,

Continuous integration tools would probably be worth a whole topic of
their own, but since they are related to version control I'm taking
them up within this scope as well.

Use case: Someone (either within or outside Apache) sets up a
continuous integration system and wants to get the latest sources from
the source repository. Optimally the system would automatically
compile, package, and test the sources after each commit, but hourly,
daily, or weekly builds would also be acceptable depending on the
scope of the tests and available computing resources.

Variants and implementation options:

1) Push-based CI: The SCM system would notify the CI system of all
source changes so that the system can start processing the changes as
soon as possible. Currently the best way to achieve this would
probably be to subscribe the CI system to the relevant -commits
mailing list, but other alternatives might also be possible. Assuming
there are enough computing resources either at Apache or an external
CI lab for such potentially high-frequency integrations, would the
associated load on our Subversion server be acceptable? If not, how
could we resolve such issues?

2) Pull-based CI: The CI system regularly polls the source repository
for new changes. Again, assuming enough CI computing resources, what
would be the smallest acceptable polling interval from the perspective
of our Subversion server. Are there other considerations that such CI
systems should be aware of when accessing the source repository?

Related to this and some of the other raised issues, would it be a
good idea to consider one or more read-only mirrors of our Subversion
repository. I'm not sure how feasible such mirroring would be with
current Subversion, but in the long run something like that seems more
scalable and fault-tolerant than upgrading a single svn server.

BR,

Jukka Zitting

Re: [scm] Use case: Continuous integration

Posted by Erik Abele <er...@codefaktor.de>.
On 01.03.2008, at 13:06, Jukka Zitting wrote:

> ...
> Related to this and some of the other raised issues, would it be a
> good idea to consider one or more read-only mirrors of our Subversion
> repository. I'm not sure how feasible such mirroring would be with
> current Subversion, but in the long run something like that seems more
> scalable and fault-tolerant than upgrading a single svn server.

This is planned for the very near future; I'd guess for mid of this  
year (needs some more h/w as well as svn 1.5)... though it'll even be  
a read-/write-mirror ;)

Cheers,
Erik


Re: [scm] Use case: Continuous integration

Posted by Santiago Gala <sa...@gmail.com>.
El sáb, 01-03-2008 a las 16:03 +0200, Jukka Zitting escribió:
> Hi,
> 
> On Sat, Mar 1, 2008 at 3:52 PM, Santiago Gala <sa...@gmail.com> wrote:
> >  El sáb, 01-03-2008 a las 14:06 +0200, Jukka Zitting escribió:
> >  > 1) Push-based CI: [...]
> >
> >  There is currently a trend to have feeds associated with SCM servers.
> 
> Feeds are certainly useful, but in this case I don't see how they
> differ that much from doing an "svn update". You're right in that
> feeds are probably much easier to cache, but it's still basically a
> pull operation.
> 

If the traffic is big enough you are trading *one* computation of the
feed per commit vs N polls for status -u or update. This is not to
mention than waiting for a server response is more frustrating than
having a news reeder telling about commits.

> >  > there are enough computing resources either at Apache or an external
> >  > CI lab for such potentially high-frequency integrations, would the
> >  > associated load on our Subversion server be acceptable? If not, how
> >  > could we resolve such issues?
> >
> >  I'm not sure about the issue. you mean that the CI server would do
> >  checkouts and delete the data for each run? if this is the case, and
> >  this is the only case where I can see a potential high load, I'd say the
> >  alternative of having some sort of incrementally updated clean
> >  sort-of-distributed repository from where to pull seems reasonable.
> 
> I recall complaints about some external CI systems putting too much
> load on our Subversion server. I'm not sure what the exact nature of
> the load is, but it seems clear that some guidelines would be useful.
> 
> BR,
> 
> Jukka Zitting
-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: [scm] Use case: Continuous integration

Posted by Santiago Gala <sa...@gmail.com>.
El lun, 03-03-2008 a las 14:20 +0000, Steve Loughran escribió:

(...)
> 
> -Steve
> 
> (*) Trivia note, have you noticed that the code search tools focus on 
> the code, not the commits. I dont need a search tool for code, I have
> an 
> IDE for that -what I like to do is know what was the person writing
> it 
> thinking at the time.
> 

This is a substantial part of what I meant with the last use case I sent
(code study/audit). A lot of times I'm interested in searching and
browsing all commits that contained a string or touched a path. Two
standard features in gitk. Or how is the activity of a code base
evolving with time.

It is very addictive once you get used to it. And given that it is
typically almost half size than a subversion working copy, including the
whole history, it is economic too.

Regards

-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: [scm] Use case: Continuous integration

Posted by Steve Loughran <st...@apache.org>.
Jukka Zitting wrote:
> Hi,

> Also, for the record of evaluation push vs. pull models, currently the
> entire ASF generates an average of about 15 commits an hour (much more
> in peak times). A normal project or codebase probably sees at most a
> commit or two per day on average.

Depends on your process. Here's how I commit @work

http://www.ohloh.net/projects/5150/contributors/22119081577892

-this is full time work, not spare time feature creep.
-we're allowed to big refactorings, though it means every 20+ child 
project is loaded in an IDE with 1+GB of heap
-no merging of the day's work; every defect should have a limited set of 
JIRA tags
-our CI server is running builds with functional tests every 30 minutes
-you are allowed to break the build, but you get to fix it or take that 
test offline until you do

with a fast CI server you can get away with commit-and-wait-for-CI news, 
rather than the stricter policy of test-before you commit. Certainly I 
make sure the tests of the bit I'm working on work, but since it takes 
10+ minutes to build and test, I delegate that wait to the CI server.

As a result, we're always changing and committing things. If you hold 
back to end of day commits, you've reverted to  nightly builds.

Jira is another point; JIRA polls the repo pretty regularly too. I'd 
hate to see what the sourceforge load is there; I'd hate to not have 
that JIRA integration, which is lovely for discovering why a file was 
changed, and what is related(*).

1. How about you have the SVN tools being able to issue XMPP 
notifications rather than any non-standard subcribe/notify protocol?

2. OhLoh does a commit log: http://www.ohloh.net/projects/5150/commits
  it would be nice to have something like that with links to JIRA defects.


-Steve

(*) Trivia note, have you noticed that the code search tools focus on 
the code, not the commits. I dont need a search tool for code, I have an 
IDE for that -what I like to do is know what was the person writing it 
thinking at the time.

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: [scm] Use case: Continuous integration

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
On Mon, Mar 3, 2008 at 4:44 AM, Jukka Zitting <ju...@gmail.com> wrote:
>  I haven't looked at how "svn update" is currently implemented, but
>  AFAIUI there should be no inherent reason why the operation could not
>  be as cache-friendly as a feed request.

'svn info' against a URL is pretty efficient on both ends.  ('svn info
--xml' for double bonus points, I guess.)

However, 'svn update' *always* requires a local working-copy crawl -
so it's highly impacted by the local disk speed.  On our CI boxes, we
see most of the time spent doing the WC crawl in the update.  So,
*local* disk speed becomes a critical factor.  (When SVN moves to
non-severable WCs with centralized metadata, then update will be a
little faster, but may still require some form of WC crawl.)

I talked with Brett about this in the last few days and I think he's
hoping to implement a 'svn info' check in Continuum soon.  -- justin

Re: [scm] Use case: Continuous integration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Mar 3, 2008 at 1:25 PM, Steve Loughran <st...@apache.org> wrote:
> Jukka Zitting wrote:
>  > Feeds are certainly useful, but in this case I don't see how they
>  > differ that much from doing an "svn update". You're right in that
>  > feeds are probably much easier to cache, but it's still basically a
>  > pull operation.
>
>  The difference is probably the load needed to generate the status. A
>  feed you GET with an etag, and a not-modified response says your REPO is
>  up to date. if you are running http requests from CI servers in a
>  corporate system, the proxy server can cache the response. Now, when the
>  tree has changed, the update still has cost.

I haven't looked at how "svn update" is currently implemented, but
AFAIUI there should be no inherent reason why the operation could not
be as cache-friendly as a feed request.

Optimizing svn update or providing feeds both seem like good ways to
reduce the load generated by pull-based CI systems.

Also, for the record of evaluation push vs. pull models, currently the
entire ASF generates an average of about 15 commits an hour (much more
in peak times). A normal project or codebase probably sees at most a
commit or two per day on average.

BR,

Jukka Zitting

Re: [scm] Use case: Continuous integration

Posted by Steve Loughran <st...@apache.org>.
Jukka Zitting wrote:
> Hi,
> 
> On Sat, Mar 1, 2008 at 3:52 PM, Santiago Gala <sa...@gmail.com> wrote:
>>  El sáb, 01-03-2008 a las 14:06 +0200, Jukka Zitting escribió:
>>  > 1) Push-based CI: [...]
>>
>>  There is currently a trend to have feeds associated with SCM servers.
> 
> Feeds are certainly useful, but in this case I don't see how they
> differ that much from doing an "svn update". You're right in that
> feeds are probably much easier to cache, but it's still basically a
> pull operation.

The difference is probably the load needed to generate the status. A 
feed you GET with an etag, and a not-modified response says your REPO is 
up to date. if you are running http requests from CI servers in a 
corporate system, the proxy server can cache the response. Now, when the 
tree has changed, the update still has cost.


> 
>>  > there are enough computing resources either at Apache or an external
>>  > CI lab for such potentially high-frequency integrations, would the
>>  > associated load on our Subversion server be acceptable? If not, how
>>  > could we resolve such issues?
>>
>>  I'm not sure about the issue. you mean that the CI server would do
>>  checkouts and delete the data for each run? if this is the case, and
>>  this is the only case where I can see a potential high load, I'd say the
>>  alternative of having some sort of incrementally updated clean
>>  sort-of-distributed repository from where to pull seems reasonable.
> 
> I recall complaints about some external CI systems putting too much
> load on our Subversion server. I'm not sure what the exact nature of
> the load is, but it seems clear that some guidelines would be useful.

Intel doing harmony builds. Part of the problem is the polling puts a 
lot of CPU load. I think also Cruise Control polled to look for 
stability before a build.




-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: [scm] Use case: Continuous integration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, Mar 1, 2008 at 3:52 PM, Santiago Gala <sa...@gmail.com> wrote:
>  El sáb, 01-03-2008 a las 14:06 +0200, Jukka Zitting escribió:
>  > 1) Push-based CI: [...]
>
>  There is currently a trend to have feeds associated with SCM servers.

Feeds are certainly useful, but in this case I don't see how they
differ that much from doing an "svn update". You're right in that
feeds are probably much easier to cache, but it's still basically a
pull operation.

>  > there are enough computing resources either at Apache or an external
>  > CI lab for such potentially high-frequency integrations, would the
>  > associated load on our Subversion server be acceptable? If not, how
>  > could we resolve such issues?
>
>  I'm not sure about the issue. you mean that the CI server would do
>  checkouts and delete the data for each run? if this is the case, and
>  this is the only case where I can see a potential high load, I'd say the
>  alternative of having some sort of incrementally updated clean
>  sort-of-distributed repository from where to pull seems reasonable.

I recall complaints about some external CI systems putting too much
load on our Subversion server. I'm not sure what the exact nature of
the load is, but it seems clear that some guidelines would be useful.

BR,

Jukka Zitting

Re: [scm] Use case: Continuous integration

Posted by Santiago Gala <sa...@gmail.com>.
El sáb, 01-03-2008 a las 14:06 +0200, Jukka Zitting escribió:
> Hi,
> 
> Continuous integration tools would probably be worth a whole topic of
> their own, but since they are related to version control I'm taking
> them up within this scope as well.
> 
> Use case: Someone (either within or outside Apache) sets up a
> continuous integration system and wants to get the latest sources from
> the source repository. Optimally the system would automatically
> compile, package, and test the sources after each commit, but hourly,
> daily, or weekly builds would also be acceptable depending on the
> scope of the tests and available computing resources.
> 
> Variants and implementation options:
> 
> 1) Push-based CI: The SCM system would notify the CI system of all
> source changes so that the system can start processing the changes as
> soon as possible. Currently the best way to achieve this would
> probably be to subscribe the CI system to the relevant -commits
> mailing list, but other alternatives might also be possible. Assuming

There is currently a trend to have feeds associated with SCM servers.
This is fully included in gitweb and "hg serve", and requires a script
(bzr-feed http://bzr.mfd-consult.dk/bzr-feed-global.atom ) for bzr. In
subversion I have used the free service at
http://subtlety.errtheblog.com/ to get something like the ASF public
repo feed: http://subtlety.errtheblog.com/O_o/2d.xml This is useful not
only for Continuous Integration, but also for having a better monitoring
of changes, see how Sam Ruby has integrated a planet of venus
repositories in planet intertwingly, for instance. It would be great
that our subversion could do that "cheaply", via
http://subversion.tigris.org/servlets/ReadMsg?listName=dev&msgNo=117974
or contrib/hook-scripts/svn2rss.py and integrate this feed serving with
feed self-discovery or in a separate URL. Publicizing it could save a
fair amount of load to the repository.

> there are enough computing resources either at Apache or an external
> CI lab for such potentially high-frequency integrations, would the
> associated load on our Subversion server be acceptable? If not, how
> could we resolve such issues?

I'm not sure about the issue. you mean that the CI server would do
checkouts and delete the data for each run? if this is the case, and
this is the only case where I can see a potential high load, I'd say the
alternative of having some sort of incrementally updated clean
sort-of-distributed repository from where to pull seems reasonable.

> 
> 2) Pull-based CI: The CI system regularly polls the source repository
> for new changes. Again, assuming enough CI computing resources, what
> would be the smallest acceptable polling interval from the perspective
> of our Subversion server. Are there other considerations that such CI
> systems should be aware of when accessing the source repository?
> 

feeds are "polling", actually, or push via polling. We could publish our
own, static and with creation triggered from commit emails, so that
people can subscribe to it instead of polling themselves.

> Related to this and some of the other raised issues, would it be a
> good idea to consider one or more read-only mirrors of our Subversion
> repository. I'm not sure how feasible such mirroring would be with
> current Subversion, but in the long run something like that seems more
> scalable and fault-tolerant than upgrading a single svn server.
> 
> BR,
> 
> Jukka Zitting
-- 
Santiago Gala
http://memojo.com/~sgala/blog/


Re: [scm] Use case: Continuous integration

Posted by sebb <se...@gmail.com>.
On 02/03/2008, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
>
>  On Sun, Mar 2, 2008 at 12:15 AM, sebb <se...@gmail.com> wrote:
>  >  Not entirely sure why one should care about temporary build failures,
>  >  unless one is trying to find out which committers are not adhering to
>  >  the rules.
>
>
> In many case developer time is much more valuable than computing time,
>  so a CI system that can quickly and accurately point out a build
>  failure can easily save time and improve productivity.

Surely that is only of use if the build failure has not already been
resolved by a later commit?  For example, it's quite easy accidentally
to omit a new file from a commit; this will normally be spotted
quickly afterwards, and then the missing file can be committed.

Not ideal of course if someone else happens to update in between, but
not really a big problem either as it will be immediately obvious when
the CI system is checked and the later build is OK.

>  That may not be a key priority for the ASF itself, but it can be quite
>  valuable to many companies that have people actively working on Apache
>  projects. If they provide the required CI machinery, what can we do on
>  the SCM side to enable optimum usage of such resources?
>
>
>  >  Also, it seems to me that this may not scale well - the average time
>  >  between commits can easily exceed the time to perform a build.
>
>
> A reasonable system would of course queue the change notifications and
>  process them one at a time. The average commit frequency will hardly
>  overload a decent CI server, and any backlog acquired during commit
>  peaks will probably be soon resolved.
>
>  This is probably also something that should be taken into account with
>  frequently running pull-based CI servers as well. At least I've seen
>  complaints about multiple CI builds stacking up when the build time
>  exceeds the poll frequency.

Or exceeds the pushed commit messages ...

>  BR,
>
>
>  Jukka Zitting
>

Re: [scm] Use case: Continuous integration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Mar 2, 2008 at 12:15 AM, sebb <se...@gmail.com> wrote:
>  Not entirely sure why one should care about temporary build failures,
>  unless one is trying to find out which committers are not adhering to
>  the rules.

In many case developer time is much more valuable than computing time,
so a CI system that can quickly and accurately point out a build
failure can easily save time and improve productivity.

That may not be a key priority for the ASF itself, but it can be quite
valuable to many companies that have people actively working on Apache
projects. If they provide the required CI machinery, what can we do on
the SCM side to enable optimum usage of such resources?

>  Also, it seems to me that this may not scale well - the average time
>  between commits can easily exceed the time to perform a build.

A reasonable system would of course queue the change notifications and
process them one at a time. The average commit frequency will hardly
overload a decent CI server, and any backlog acquired during commit
peaks will probably be soon resolved.

This is probably also something that should be taken into account with
frequently running pull-based CI servers as well. At least I've seen
complaints about multiple CI builds stacking up when the build time
exceeds the poll frequency.

BR,

Jukka Zitting

Re: [scm] Use case: Continuous integration

Posted by Steve Loughran <st...@apache.org>.
sebb wrote:
> On 01/03/2008, Jukka Zitting <ju...@gmail.com> wrote:

>>
>>  >  If there are 10 commits in as many minutes, does the CI system really
>>  >  need to build each one?
>>
>>
>> It wuold IMHO be very useful to have clear indication of which one of
>>  those 10 commits actually broke the build.
> 
> Not entirely sure why one should care about temporary build failures,
> unless one is trying to find out which committers are not adhering to
> the rules.

What you could do is intermittent builds, but when it fails, do a binary 
search on all changes since then to work out which commit broke it.


I'm also exploring some ideas of raising the notion of building to a new 
level, as sketched out in "The Future of Build Tools"

http://docs.google.com/Doc?id=dtrrw53_3wqsb9s

Imagine every developer having a CI tool running against their local 
code (or private branch) all the time. You save something, and after a 
bit of idleness the build and tests run. If you are writing dynamic web 
sites, the tests could run all the time, you'd just hit reload on the 
results page (or the reload could trigger the rebuild).

Works best with scripted languages, where there is no compile process, 
just testing. I'm targeting erlang first of all, as I'm pretty 
unimpressed with the current erlang build process.


-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: [scm] Use case: Continuous integration

Posted by sebb <se...@gmail.com>.
On 01/03/2008, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
>
>  On Sat, Mar 1, 2008 at 3:40 PM, sebb <se...@gmail.com> wrote:
>  > On 01/03/2008, Jukka Zitting <ju...@gmail.com> wrote:
>
> >  >  Continuous integration tools would probably be worth a whole topic of
>  >  >  their own, but since they are related to version control I'm taking
>  >  >  them up within this scope as well.
>  >  >
>  >  >  Use case: Someone (either within or outside Apache) sets up a
>  >  >  continuous integration system and wants to get the latest sources from
>  >  >  the source repository. Optimally the system would automatically
>  >  >  compile, package, and test the sources after each commit, but hourly,
>  >
>  >  Related changes are not always packaged into a single commit.
>  >
>  >  Sometimes it is easier to use several commits; though hopefully each
>  >  one will be self-contained, i.e. will not break the build.
>
>
> Sure, but that's IMHO related to the sequence of changes use case, and
>  from a purist perspective such changes would probably be best handled
>  through a short-lived development branch. And ideally such changes
>  could then be merged back to trunk as an atomic commit that still
>  preserves the full incremental change history.
>
>  The exact commit and consistency rules are of course up to each
>  project to decide for themselves, but there are a number of projects
>  with a policy that no commit should break the build. The more
>  frequently the CI system runs, the less chance there is for another
>  developer to stumble on a broken build.
>
>
>  >  If there are 10 commits in as many minutes, does the CI system really
>  >  need to build each one?
>
>
> It wuold IMHO be very useful to have clear indication of which one of
>  those 10 commits actually broke the build.

Not entirely sure why one should care about temporary build failures,
unless one is trying to find out which committers are not adhering to
the rules.

Also, it seems to me that this may not scale well - the average time
between commits can easily exceed the time to perform a build.

>  BR,
>
>
>  Jukka Zitting
>

Re: [scm] Use case: Continuous integration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, Mar 1, 2008 at 3:40 PM, sebb <se...@gmail.com> wrote:
> On 01/03/2008, Jukka Zitting <ju...@gmail.com> wrote:
>  >  Continuous integration tools would probably be worth a whole topic of
>  >  their own, but since they are related to version control I'm taking
>  >  them up within this scope as well.
>  >
>  >  Use case: Someone (either within or outside Apache) sets up a
>  >  continuous integration system and wants to get the latest sources from
>  >  the source repository. Optimally the system would automatically
>  >  compile, package, and test the sources after each commit, but hourly,
>
>  Related changes are not always packaged into a single commit.
>
>  Sometimes it is easier to use several commits; though hopefully each
>  one will be self-contained, i.e. will not break the build.

Sure, but that's IMHO related to the sequence of changes use case, and
from a purist perspective such changes would probably be best handled
through a short-lived development branch. And ideally such changes
could then be merged back to trunk as an atomic commit that still
preserves the full incremental change history.

The exact commit and consistency rules are of course up to each
project to decide for themselves, but there are a number of projects
with a policy that no commit should break the build. The more
frequently the CI system runs, the less chance there is for another
developer to stumble on a broken build.

>  If there are 10 commits in as many minutes, does the CI system really
>  need to build each one?

It wuold IMHO be very useful to have clear indication of which one of
those 10 commits actually broke the build.

BR,

Jukka Zitting

Re: [scm] Use case: Continuous integration

Posted by sebb <se...@gmail.com>.
On 01/03/2008, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
>  Continuous integration tools would probably be worth a whole topic of
>  their own, but since they are related to version control I'm taking
>  them up within this scope as well.
>
>  Use case: Someone (either within or outside Apache) sets up a
>  continuous integration system and wants to get the latest sources from
>  the source repository. Optimally the system would automatically
>  compile, package, and test the sources after each commit, but hourly,

Related changes are not always packaged into a single commit.

Sometimes it is easier to use several commits; though hopefully each
one will be self-contained, i.e. will not break the build.

If there are 10 commits in as many minutes, does the CI system really
need to build each one?

>  daily, or weekly builds would also be acceptable depending on the
>  scope of the tests and available computing resources.
>
>  Variants and implementation options:
>
>  1) Push-based CI: The SCM system would notify the CI system of all
>  source changes so that the system can start processing the changes as
>  soon as possible. Currently the best way to achieve this would
>  probably be to subscribe the CI system to the relevant -commits
>  mailing list, but other alternatives might also be possible. Assuming
>  there are enough computing resources either at Apache or an external
>  CI lab for such potentially high-frequency integrations, would the
>  associated load on our Subversion server be acceptable? If not, how
>  could we resolve such issues?
>
>  2) Pull-based CI: The CI system regularly polls the source repository
>  for new changes. Again, assuming enough CI computing resources, what
>  would be the smallest acceptable polling interval from the perspective
>  of our Subversion server. Are there other considerations that such CI
>  systems should be aware of when accessing the source repository?
>
>  Related to this and some of the other raised issues, would it be a
>  good idea to consider one or more read-only mirrors of our Subversion
>  repository. I'm not sure how feasible such mirroring would be with
>  current Subversion, but in the long run something like that seems more
>  scalable and fault-tolerant than upgrading a single svn server.
>
>  BR,
>
>
>  Jukka Zitting
>