You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Eric S. Raymond" <es...@thyrsus.com> on 2009/10/07 22:17:28 UTC

Three common failings in project hosting systems

One of the consequences of the berlios.de crash this weekend is
svncutter. Another is the following rant, revisiting some issues that
have been bothering me ever since I was told I had inadvertently
influenced the design of the original SourceForge ten years ago.  I'm
posting it here because of David Glasser's last reply to me; I think
the relevance will be clear.

===========================================================================

=== Data Jails ===

The worst problem with almost all current hosting sites is that
they're data jails. You can put data (the source code revision
history, mailing list address lists, bug reports) into them, but
getting a complete snapshot of that data back out often ranges from
painful to impossible.

Why is this an issue? Very practically, because hosting sites, even
well-established ones, sometimes go off the air. Any prudent project
lead should be thinking about how to recover if that happens, and how
to take periodic backups of critical project data.  But more generally,
it's *your data*.  You should own it.  If you can't push a button and
get a snapshot of your project state out of the site whenever you
want, you *don't* own it.

When berlios.de crashed on me, I was lucky; I had been preparing to 
migrate GPSD off the site due to deteriorating performance; I had
a Subversion dump file that was less than two weeks old.  I was
able to bring that up to date by translating commits from an
unofficial git mirror. I was doubly lucky in that the Mailman
adminstrative pages remained accessible even when the project
webspace and repositories had been 404 for two days.

But actually retrieving my mailing-list data was a hideous process
that involved screen-scraping HTML by hand, and I had no hope at all
of retrieving the bug tracker state.

This anecdote illustrates the most serious manifestations of the
data-jail problem.  Third-generation version-control (hg, git, bzr,
etc.) systems pretty much solve it for code repositories; every
checkout is a mirror.  But most projects have two other critical data
collections: their mailing-list state and their bug-tracker state.
And, on all sites I know of in late 2009, those are seriously jailed.

This is a problem that goes straight to the design of the software
subsystems used by these sites.  Some are generic: of these, the most
frequent single offender is 2.x versions of Mailman, the most widely
used mailing-list manager (the Mailman maintainers claim to have fixed
this in 3.0). Bug-trackers tend to be tightly tied to individual
hosting engines, and are even harder to dig data out of.  They also
illustrate the second major failing...

=== Unscriptability ===

All hosting-site suites are Web-centric, operated primarily or
entirely through a browser.  This solves many problems, but creates a
few as well.  One is that browsers, like GUIs in general, are badly
suited for stereotyped and repetitive tasks.  Another is that they
have poor accessibility for people with visual or motor-control
issues.

Here again the issues with version-control systems are relatively
minor, because all those in common use are driven by CLI tools 
that are easy to script.  Mailing lists don't present serious issues
either; the only operation on them that normally goes through the web 
is moderation of submissions, and the demands of that operation are
fairly well matched to a browser-style interface. 

But there are other common operations that need to be scriptable and
are generally not. A representative one is getting a list of open bug
reports to work on later - say, somewhere that your net connection is
spotty.  There is no reason this couldn't be handled by an email
autoresponder robot connected to the bug-tracker database, a feature
which would also improve tracker accessibility for the blind.

Another is shipping a software release.  This normally consists of
uploading product files in various shipping formats (source tarballs,
debs, RPMs, and the like) to the hosting site's download area, and
associating with them a bunch of metadata including such things as a
short-form release announcement, file-type or architecture tags for
the binary packages, MD5 signatures, and the like.

With the exception of the release announcement, there is really no
reason a human being should be sitting at a web browser to type in
this sort of thing. In fact there is an excellent reasons a human
*shouldn't* do it by hand - it's exactly the sort of fiddly, tedious
semi-mechanical chore at which humans tend to make (and then miss)
finger errors because the brain is not fully engaged.

It would be better for the hosting system's release-registration logic
to accept a job card via email, said job card including all the 
release metadata and URLs pointing to the product files it should
gather for the release.  Each job card could be generated by a 
project-specific script that would take the parts that really need
human attention from a human and mechanically fill in the rest. This
would both minimize human error and improve accessibility.

In general, a good question for hosting-system designers to be asking
themselves about each operation of the system would be "Do I provide a
way to remote-script this through an email robot or XML-RPC interface
or the like?" When the answer is "no", that's a bug that needs to be
fixed.

=== Poor support for immigration ===

The first (and in my opinion, most serious) failing I identified is
poor support for snapshotting and if necessary out-migrating a
project.  Most hosting systems do almost as badly at in-migrating a
project that already has a history, as opposed to one started from
nothing on the site.  

Even uploading an existing source code repository at start of a
project (as opposed to starting with an empty one) is only spottily
supported. Just try, for example, to find a site that will let you upload
a mailbox full of archives from a pre-existing development list in
order to re-home it at the project's new development site.  

This is the flip side of the data-jail problem. It has some of the
same causes, and many of the same consequences too.  Because it makes 
re-homing projects unnecessarily difficult, it means that project
leads cannot respond effectively to hosting-site problems.  This 
creates a systemic brittleness in our development infrastructure.

===========================================================================

I believe in underpromising and overperforming, so I'm not going to
talk up any grand plans to fix this.  But I will say that I intend to
do more than talk.  And yesterday the project leaders of Savane, the
hosting system that powers gna.org and Savanna, read this and invited
me to join their project team.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Ideology, politics and journalism, which luxuriate in failure, are
impotent in the face of hope and joy.
	-- P. J. O'Rourke

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2404722

Re: Three common failings in project hosting systems

Posted by David Glasser <gl...@davidglasser.net>.
On Wed, Oct 7, 2009 at 3:17 PM, Eric S. Raymond <es...@thyrsus.com> wrote:
> One of the consequences of the berlios.de crash this weekend is
> svncutter. Another is the following rant, revisiting some issues that
> have been bothering me ever since I was told I had inadvertently
> influenced the design of the original SourceForge ten years ago.  I'm
> posting it here because of David Glasser's last reply to me; I think
> the relevance will be clear.

[ Note: I'm a Google employee as well as a (semi-active) Subversion
developer; I work on our Project Hosting service in my 20% time.  I'm
making this post because I do truly believe as an individual that our
product is a good one and that it's exciting that we fulfill most of
your requirements below. I feel less qualified to discuss the other
prominent hosting sites out there, though many of them are excellent
as well. ]

Hi Eric.  Your message had better timing than you knew; see this, just
announced today:

http://googlecode.blogspot.com/2009/10/issue-tracker-data-api-for-project.html

as well as http://www.dataliberation.org/google/code-project-hosting
for an overview of the support for getting your data in and out of
Google's Project Hosting service.

>
> ===========================================================================
>
> === Data Jails ===
>
> The worst problem with almost all current hosting sites is that
> they're data jails. You can put data (the source code revision
> history, mailing list address lists, bug reports) into them, but
> getting a complete snapshot of that data back out often ranges from
> painful to impossible.
>
> Why is this an issue? Very practically, because hosting sites, even
> well-established ones, sometimes go off the air. Any prudent project
> lead should be thinking about how to recover if that happens, and how
> to take periodic backups of critical project data.  But more generally,
> it's *your data*.  You should own it.  If you can't push a button and
> get a snapshot of your project state out of the site whenever you
> want, you *don't* own it.
>
> When berlios.de crashed on me, I was lucky; I had been preparing to
> migrate GPSD off the site due to deteriorating performance; I had
> a Subversion dump file that was less than two weeks old.  I was
> able to bring that up to date by translating commits from an
> unofficial git mirror. I was doubly lucky in that the Mailman
> adminstrative pages remained accessible even when the project
> webspace and repositories had been 404 for two days.
>
> But actually retrieving my mailing-list data was a hideous process
> that involved screen-scraping HTML by hand, and I had no hope at all
> of retrieving the bug tracker state.
>
> This anecdote illustrates the most serious manifestations of the
> data-jail problem.  Third-generation version-control (hg, git, bzr,
> etc.) systems pretty much solve it for code repositories; every
> checkout is a mirror.  But most projects have two other critical data
> collections: their mailing-list state and their bug-tracker state.
> And, on all sites I know of in late 2009, those are seriously jailed.
>
> This is a problem that goes straight to the design of the software
> subsystems used by these sites.  Some are generic: of these, the most
> frequent single offender is 2.x versions of Mailman, the most widely
> used mailing-list manager (the Mailman maintainers claim to have fixed
> this in 3.0). Bug-trackers tend to be tightly tied to individual
> hosting engines, and are even harder to dig data out of.  They also
> illustrate the second major failing...

As you can see from the links above, Google Project Hosting now offers
an API to our issue tracker which you can use to write export/import
tools.  (We don't have a push-button tool that exports or imports an
entire project in one fell swoop now, but I'd be happy to add a link
to one to our Data Liberation page above if somebody wrote one!) As
you mention, exporting and importing Mercurial repositories is
trivial, and the svnsync tool provides a relatively straightforward
way of exporting or importing Subversion repositories. The site wiki
is hosted in Subversion or Mercurial and can be exported that way.
Downloads can be listed via  a feed and, well, downloaded.

(Of course, moving from Google Code to a different platform would
still require translating issue metadata from one system to another,
translating wiki markup, etc.)

Mailing lists are another story.  Google Project Hosting doesn't
itself provide a mailing list service; it allows you to specify
mailing list addresses, set up commit mails to mailing lists, and so
on, and certainly many projects choose to use Google Groups for this,
but this is essentially outside of the scope of a project hosting site
in my opinion.  Choose the best mailing list provider for your
project's needs and your hosting site should be able to work with
that.

> === Unscriptability ===
>
> All hosting-site suites are Web-centric, operated primarily or
> entirely through a browser.  This solves many problems, but creates a
> few as well.  One is that browsers, like GUIs in general, are badly
> suited for stereotyped and repetitive tasks.  Another is that they
> have poor accessibility for people with visual or motor-control
> issues.
>
> Here again the issues with version-control systems are relatively
> minor, because all those in common use are driven by CLI tools
> that are easy to script.  Mailing lists don't present serious issues
> either; the only operation on them that normally goes through the web
> is moderation of submissions, and the demands of that operation are
> fairly well matched to a browser-style interface.
>
> But there are other common operations that need to be scriptable and
> are generally not. A representative one is getting a list of open bug
> reports to work on later - say, somewhere that your net connection is
> spotty.  There is no reason this couldn't be handled by an email
> autoresponder robot connected to the bug-tracker database, a feature
> which would also improve tracker accessibility for the blind.
>
> Another is shipping a software release.  This normally consists of
> uploading product files in various shipping formats (source tarballs,
> debs, RPMs, and the like) to the hosting site's download area, and
> associating with them a bunch of metadata including such things as a
> short-form release announcement, file-type or architecture tags for
> the binary packages, MD5 signatures, and the like.
>
> With the exception of the release announcement, there is really no
> reason a human being should be sitting at a web browser to type in
> this sort of thing. In fact there is an excellent reasons a human
> *shouldn't* do it by hand - it's exactly the sort of fiddly, tedious
> semi-mechanical chore at which humans tend to make (and then miss)
> finger errors because the brain is not fully engaged.
>
> It would be better for the hosting system's release-registration logic
> to accept a job card via email, said job card including all the
> release metadata and URLs pointing to the product files it should
> gather for the release.  Each job card could be generated by a
> project-specific script that would take the parts that really need
> human attention from a human and mechanically fill in the rest. This
> would both minimize human error and improve accessibility.
>
> In general, a good question for hosting-system designers to be asking
> themselves about each operation of the system would be "Do I provide a
> way to remote-script this through an email robot or XML-RPC interface
> or the like?" When the answer is "no", that's a bug that needs to be
> fixed.

Google Project Hosting's issue tracker and downloads (releases) are scriptable.

> === Poor support for immigration ===
>
> The first (and in my opinion, most serious) failing I identified is
> poor support for snapshotting and if necessary out-migrating a
> project.  Most hosting systems do almost as badly at in-migrating a
> project that already has a history, as opposed to one started from
> nothing on the site.
>
> Even uploading an existing source code repository at start of a
> project (as opposed to starting with an empty one) is only spottily
> supported.

This is documented on our Data Liberation page (and FAQ).

> Just try, for example, to find a site that will let you upload
> a mailbox full of archives from a pre-existing development list in
> order to re-home it at the project's new development site.
>
> This is the flip side of the data-jail problem. It has some of the
> same causes, and many of the same consequences too.  Because it makes
> re-homing projects unnecessarily difficult, it means that project
> leads cannot respond effectively to hosting-site problems.  This
> creates a systemic brittleness in our development infrastructure.
>
> ===========================================================================
>
> I believe in underpromising and overperforming, so I'm not going to
> talk up any grand plans to fix this.  But I will say that I intend to
> do more than talk.  And yesterday the project leaders of Savane, the
> hosting system that powers gna.org and Savanna, read this and invited
> me to join their project team.

In conclusion, I think that Google Project Hosting currently fits all
of your wishes for a project hosting site, other than those related to
mailing lists, which it essentially defers by not offering a
software-project-specific mailing list hosting service.

--dave

-- 
glasser@davidglasser.net | langtonlabs.org | flickr.com/photos/glasser/

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2408324