You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Eric Raymond <es...@snark.thyrsus.com> on 2010/11/09 17:17:50 UTC

Announcing reposurgeon, and requesting fast-import support.

Some months back I contributed svncutter to Subversion.  This was a tool
for doing surgery on dumpfiles intended to remove artifacts associated with
conversions from older VCSes.

My interest in tools for repository surgery has continued, and I recently
spotted an opportunity in the increasing use of git-fast-import streams
as a history-interchange format.  I have written what I believe is the first 
*native* application for fast-import streams, a repository editor I
call reposurgeon.

You can read the announcement here: http://esr.ibiblio.org/?p=2718

Project resource page with tarballs: http://www.catb.org/~esr/reposurgeon/

Freshmeat page: http://freshmeat.net/projects/reposurgeon

HTML manual: http://www.catb.org/~esr/reposurgeon/reposurgeon.html

Perhaps the most interesting thing about reposurgeon is that, by
design, it knows almost nothing about any individual VCS.  All it
counts on is the ability to get a fast-import dump from a repo and
then the ability to create a repo from the dump after the contents of
the import stream has been modified.

If you hadn't heard about this before, it's because the project is in 
alpha and only two weeks old.  Nevertheless, it is already good enough
for production use on git repositories. Operations supported include
editing of commit and tag metadata, deletion of commits, expunges of
file history, coalescing single-file commit cliques with identical
comments, and topological cut. The code is backed by an extensive
regression-test suite and fully documented.

I also have working support for bzr and hg, though the practical utility
of same is presently limited by unstable and poorly-supported export/import
tools. I'm working with a bzr dev to address this problem; better solutions
should be forthcoming within weeks, if not days.

Which brings me to my feature request.  Please add native support for
fast-export and fast-import to svndump.  This would be a good idea
in general, but my specific reason for wanting it is to enable
reposurgeon to edit Subversion repositories.

The export side is, of course, almost trivial.  Proof of concept under
MIT license is here: <http://c133.org/code/svn-fast-export.c>.  It
needs a bit of extension work around tags and branches; I won't
belabor the obvious (and easily solvable) issues with those.  There are
two more substantive ones:

1) Whatever merge-tracking hair you represent internally should be dumped
'as 'merge' commit properties.

2) User commit properties (e.g. those not in the svn: namespace)
should be exported using the bzr properties extension, which
reposurgeon handles now and which seems likely to make it into git core at
some point.  Syntax:

   property <space> NAME <space> VALUE-LENGTH <space> VALUE LF

or, if the value is empty:

   property <space> NAME LF
 
NAME and VALUE are utf8-encoded.  The properties for each commit are sorted 
by the property name.

Also note that an import stream actually containing commit-property declarations
should have a line reading "feature commit-properties" before the first commit.

The import side is less trivial, but given that you've already got internal
representations for merge-tracking it shouldn't be too difficult either.

I'd offer to do this, but I'm deliberately staying away from writing
export/import code myself, other than the implementations inside
reposurgeon. It will be better, long-term, if my reposurgeon
assumptions don't leak into other implementations; they ought to be
engineered from the fast-import stream documentation.  See the
definitive web page at:

<http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html>.

Finally, I will note that I think this feature could be significant
for Subversion's competitive posture. Because exporters are easy while
importers are more difficult, supporting import streams only with
exporters and only through sketchy third-party tools tends to
encourage migration to git while discouraging migration away from it.

Other VCSes, with bzr taking point, are positioning themselves as
destinations rather than places to leave by mainlining importers.  As
a friend of Subversion, I strongly recommend that it should do
likewise.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

A human being should be able to change a diaper, plan an invasion,
butcher a hog, conn a ship, design a building, write a sonnet, balance
accounts, build a wall, set a bone, comfort the dying, take orders, give
orders, cooperate, act alone, solve equations, analyze a new problem,
pitch manure, program a computer, cook a tasty meal, fight efficiently,
die gallantly. Specialization is for insects.
	-- Robert A. Heinlein, "Time Enough for Love"

Re: Announcing reposurgeon, and requesting fast-import support.

Posted by Ramkumar Ramachandra <ar...@gmail.com>.
Hi,

[+CC: Daniel, for making me notice this email in the first place]

Eric Raymond writes:
> My interest in tools for repository surgery has continued, and I recently
> spotted an opportunity in the increasing use of git-fast-import streams
> as a history-interchange format.  I have written what I believe is the first 
> *native* application for fast-import streams, a repository editor I
> call reposurgeon.

Cool tool!

> Which brings me to my feature request.  Please add native support for
> fast-export and fast-import to svndump.  This would be a good idea
> in general, but my specific reason for wanting it is to enable
> reposurgeon to edit Subversion repositories.
> 
> The export side is, of course, almost trivial.  Proof of concept under
> MIT license is here: <http://c133.org/code/svn-fast-export.c>.  It
> needs a bit of extension work around tags and branches; I won't
> belabor the obvious (and easily solvable) issues with those.  There are
> two more substantive ones:

With svnrdump (merged into Subversion trunk in subversion/svnrdump)
and svn-fe (merged into git.git in contrib/svn-fe + vcs-svn/), it's
possible to produce a fast-import stream from a remote repository
without the need for any local mirroring. Unforutunately, svnrdump can
only produce a deltified dumpfile v3, and the patch series that adds
dumpfile v3 support to svn-fe hasn't been merged into git.git yet- you
can pick up the branch `dumpfilev3` from David's repository
<http://github.com/barrbrain/git> though.

> 1) Whatever merge-tracking hair you represent internally should be dumped
> 'as 'merge' commit properties.
> 
> 2) User commit properties (e.g. those not in the svn: namespace)
> should be exported using the bzr properties extension, which
> reposurgeon handles now and which seems likely to make it into git core at
> some point.  Syntax:
> 
>    property <space> NAME <space> VALUE-LENGTH <space> VALUE LF
> 
> or, if the value is empty:
> 
>    property <space> NAME LF
>  
> NAME and VALUE are utf8-encoded.  The properties for each commit are sorted 
> by the property name.
> 
> Also note that an import stream actually containing commit-property declarations
> should have a line reading "feature commit-properties" before the first commit.

Actually, the objective of svn-fe is to produce a conformant
fast-import stream (so Git can import it into its object store): some
information is lost in the process. Does reposurgeon require all the
information, or can it operate on the stream that svn-fe produces?

That brings us to another point: a fast-import stream is probably not
the most faithful representation of a Subversion repository, and I
think a dumpfile v3 fits this bill. Subversion already supports native
export/ import of this format: svnadmin (dump|load) when mirrored
locally and svnrdump (dump|load) when it's not :)

> The import side is less trivial, but given that you've already got internal
> representations for merge-tracking it shouldn't be too difficult either.
> <http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html>.

svn-fe already supports converting a dumpfile v3 to a fast-import
stream. Getting it to do the reverse shouldn't be too hard- we are
already working on it :)

> Finally, I will note that I think this feature could be significant
> for Subversion's competitive posture. Because exporters are easy while
> importers are more difficult, supporting import streams only with
> exporters and only through sketchy third-party tools tends to
> encourage migration to git while discouraging migration away from it.

Are you happy with having a combination of svnrdump and svn-fe for
this, or do you think Subversion should natively support fast-import?
I don't think it'll be very difficult to support natively, but it's
kind of a hack because Subversion already has so much infrastructure
to deal with dumpfile v3.

p.s- I'm one of the students who did a GSoC project with Git this
summer. If you recall, you even commented on the proposal I posted to
the Git list :) svnrdump and svn-fe are the products of that same
project.

-- Ram

Re: Announcing reposurgeon, and requesting fast-import support.

Posted by Eric Raymond <es...@thyrsus.com>.
Julian Foad <ju...@wandisco.com>:
> Very cool.  I wonder how practical it will be for doing various
> "obliterate" tasks on large repositories.

The main overhead is, as you might imagine, the parse time for the .fi
file.  And, I admit, it can be painfully slow.  Loading up the git
repository takes 40 minutes on my PC.  I would be more concerned about
this if the operations reposurgeon supports weren't unusual, generally
one-time procedures.

But, as is, the tool works, and I'm trying to follow the "Make it
work, make it right, *then* make it fast" heuristic.  Even though one
colleague has argued with a straight face that I *shouldn't* speed-tune
it - he thinks repo surgery is so risky and potentially shady that
it's good for using the tool to require sustained attention and an 
effort of will.

> > NAME and VALUE are utf8-encoded.  The properties for each commit are sorted 
> > by the property name
> 
> Ah, so the format doesn't support arbitrary 'binary' property values?  I
> guess we can seek a way to work around that.

Indeed. Base-64 encoding is our friend :-).
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Announcing reposurgeon, and requesting fast-import support.

Posted by Julian Foad <ju...@wandisco.com>.
On Tue, 2010-11-09, Eric Raymond wrote:
> Some months back I contributed svncutter to Subversion.  This was a tool
> for doing surgery on dumpfiles intended to remove artifacts associated with
> conversions from older VCSes.
> 
> My interest in tools for repository surgery has continued, and I recently
> spotted an opportunity in the increasing use of git-fast-import streams
> as a history-interchange format.  I have written what I believe is the first 
> *native* application for fast-import streams, a repository editor I
> call reposurgeon.

Very cool.  I wonder how practical it will be for doing various
"obliterate" tasks on large repositories.  ("Obliterate" can mean quite
a range of different things, including remove files, or set their
content to empty, in certain revs or rev ranges.)  There is still demand
for on-line obliteration (to be performed while the untouched parts of
the repository remain accessible) but that is very difficult to achieve
(and I'm stalled), and I hope this option for off-line editing may be
able to take some of the pressure off.  Peter S just reminded me that
some obliterate tasks involve only a few recent revisions, so we wonder
if it is practical to dump and edit and re-load only the last few
revisions, if we assume that we can make Subversion forget the last N
revisions. 


> You can read the announcement here: http://esr.ibiblio.org/?p=2718
> 
> Project resource page with tarballs: http://www.catb.org/~esr/reposurgeon/
> 
> Freshmeat page: http://freshmeat.net/projects/reposurgeon
> 
> HTML manual: http://www.catb.org/~esr/reposurgeon/reposurgeon.html
> 
> Perhaps the most interesting thing about reposurgeon is that, by
> design, it knows almost nothing about any individual VCS.  All it
> counts on is the ability to get a fast-import dump from a repo and
> then the ability to create a repo from the dump after the contents of
> the import stream has been modified.
> 
> If you hadn't heard about this before, it's because the project is in 
> alpha and only two weeks old.  Nevertheless, it is already good enough
> for production use on git repositories. Operations supported include
> editing of commit and tag metadata, deletion of commits, expunges of
> file history, coalescing single-file commit cliques with identical
> comments, and topological cut. The code is backed by an extensive
> regression-test suite and fully documented.
> 
> I also have working support for bzr and hg, though the practical utility
> of same is presently limited by unstable and poorly-supported export/import
> tools. I'm working with a bzr dev to address this problem; better solutions
> should be forthcoming within weeks, if not days.
> 
> Which brings me to my feature request.  Please add native support for
> fast-export and fast-import to svndump.  This would be a good idea
> in general, but my specific reason for wanting it is to enable
> reposurgeon to edit Subversion repositories.
> 
> The export side is, of course, almost trivial.  Proof of concept under
> MIT license is here: <http://c133.org/code/svn-fast-export.c>.  It
> needs a bit of extension work around tags and branches; I won't
> belabor the obvious (and easily solvable) issues with those.  There are
> two more substantive ones:
> 
> 1) Whatever merge-tracking hair you represent internally should be dumped
> 'as 'merge' commit properties.
> 
> 2) User commit properties (e.g. those not in the svn: namespace)
> should be exported using the bzr properties extension, which
> reposurgeon handles now and which seems likely to make it into git core at
> some point.  Syntax:
> 
>    property <space> NAME <space> VALUE-LENGTH <space> VALUE LF
> 
> or, if the value is empty:
> 
>    property <space> NAME LF
>  
> NAME and VALUE are utf8-encoded.  The properties for each commit are sorted 
> by the property name

Ah, so the format doesn't support arbitrary 'binary' property values?  I
guess we can seek a way to work around that.

> Also note that an import stream actually containing commit-property declarations
> should have a line reading "feature commit-properties" before the first commit.
> 
> The import side is less trivial, but given that you've already got internal
> representations for merge-tracking it shouldn't be too difficult either.
> 
> I'd offer to do this, but I'm deliberately staying away from writing
> export/import code myself, other than the implementations inside
> reposurgeon. It will be better, long-term, if my reposurgeon
> assumptions don't leak into other implementations; they ought to be
> engineered from the fast-import stream documentation.  See the
> definitive web page at:
> 
> <http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html>.
> 
> Finally, I will note that I think this feature could be significant
> for Subversion's competitive posture. Because exporters are easy while
> importers are more difficult, supporting import streams only with
> exporters and only through sketchy third-party tools tends to
> encourage migration to git while discouraging migration away from it.
> 
> Other VCSes, with bzr taking point, are positioning themselves as
> destinations rather than places to leave by mainlining importers.  As
> a friend of Subversion, I strongly recommend that it should do
> likewise.


Thanks.

- Julian